### Calculate Summary Statistics for a DataFrame or Series

In [1]:
import pandas as pd

mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")
mexico_city1.head()

Unnamed: 0,property_type,department,lat-lon,area_m2,price_usd
0,house,A,"4.69,-74.048",187.0,330899.98
1,house,C,"4.695,-74.082",82.0,121555.09
2,house,A,"4.535,-75.676",235.0,219474.47
3,house,C,"4.62,-74.129",195.0,97919.38
4,house,C,"4.62,123.23",,97919.38


In [2]:
mexico_city1.describe()

Unnamed: 0,area_m2,price_usd
count,4.0,5.0
mean,174.75,173553.66
std,65.301225,101256.512276
min,82.0,97919.38
25%,160.75,97919.38
50%,191.0,121555.09
75%,205.0,219474.47
max,235.0,330899.98


Like most large datasets, this one has many values which are missing. The describe function will ignore missing values in each column. You can also remove rows and columns with missing values, and then get a summary of the data that's still there. We need to remove columns first, before removing the rows; the sequence of operations here is important. The code looks like this:

In [4]:
mexico_city1 = mexico_city1.drop(["department"], axis=1)
mexico_city1 = mexico_city1.dropna(axis=0)
mexico_city1.head()

Unnamed: 0,property_type,lat-lon,area_m2,price_usd
0,house,"4.69,-74.048",187.0,330899.98
1,house,"4.695,-74.082",82.0,121555.09
2,house,"4.535,-75.676",235.0,219474.47
3,house,"4.62,-74.129",195.0,97919.38


In [5]:
mexico_city1.describe()

Unnamed: 0,area_m2,price_usd
count,4.0,4.0
mean,174.75,192462.23
std,65.301225,106240.050082
min,82.0,97919.38
25%,160.75,115646.1625
50%,191.0,170514.78
75%,205.0,247330.8475
max,235.0,330899.98


### Select a Series from a DataFrame

Since the datasets we work with are so large, you might want to focus on a single column of a DataFrame. Let's load up the mexico-city-real-estate-2 dataset, and examine the first few rows to find the column names.

In [6]:
mexico_city2 = pd.read_csv("./data/mexico-city-real-estate-2.csv")
mexico_city2.head()

Unnamed: 0,property_type,department,lat-lon,area_m2,price_usd
0,house,A,"4.69,-74.048",187.0,330899.98
1,house,C,"4.695,-74.082",82.0,121555.09
2,house,A,"4.535,-75.676",235.0,219474.47
3,house,C,"4.62,-74.129",195.0,97919.38
4,house,C,"4.62,123.23",,97919.38


In [7]:
price = mexico_city2["price_usd"]
print(price)

0    330899.98
1    121555.09
2    219474.47
3     97919.38
4     97919.38
Name: price_usd, dtype: float64


In [9]:
mexico_city_number = mexico_city2.select_dtypes(include = "number")
mexico_city_number.head()

Unnamed: 0,area_m2,price_usd
0,187.0,330899.98
1,82.0,121555.09
2,235.0,219474.47
3,195.0,97919.38
4,,97919.38


### Working with value_counts in a Series

In order to use the data in a series for other types of analysis, it might be helpful to know how often each value occurs in the Series. To do that, we use the value_counts method to aggregate the data. Let's take a look at the number of properties associated with each department in the `mexico-city-real-estate-1` dataset.

In [10]:
df1 = pd.read_csv("data/mexico-city-real-estate-1.csv", usecols=["department"])
df1["department"].value_counts()

department
C    3
A    2
Name: count, dtype: int64

### Series and `Groupby`

Large Series often include data points that have some attribute in common, but which are nevertheless not grouped together in the dataset. Happily, pandas has a method that will bring these data points together into groups.

Let's take a look at the `mexico-city-real-estate-1` dataset. The set includes properties scattered across Colombia, so it might be useful to group properties from the same department together; to do this, we'll use the groupby method. The code looks like this

In [14]:
dept_group = df1.groupby("department")

In [12]:
dept_group.first()

A
C


Now that we have all the properties grouped by department, we might want to see the properties in just one of the departments. We can use the get_group method to do that. If we just wanted to see the properties in "A", for example, the code would look like this:

In [15]:
dept_group = df1.groupby("department")
dept_group.get_group("A")

Unnamed: 0,department
0,A
2,A


We can also make groups based on more than one category by adding them to the groupby method. After resetting the df1 DataFrame, here's what the code looks like if we want to group properties both by department and by property_type.

In [17]:
df1 = pd.read_csv("data/mexico-city-real-estate-2.csv")
dept_group2 = df1.groupby(["department", "property_type"])
dept_group2.first()

Unnamed: 0_level_0,Unnamed: 1_level_0,lat-lon,area_m2,price_usd
department,property_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,house,"4.69,-74.048",187.0,330899.98
C,house,"4.695,-74.082",82.0,121555.09


Finally, it's possible to use `groupby` to calculate aggregations. For example, if we wanted to find the average property area in each department, we would use the `.mean()` method. This is what the code for that looks like:

### Pivot Tables

A pivot table allows us to aggregate and summarize a DataFrame across multiple variables. For example, let's suppose we wanted to calculate the mean of the price column in the `mexico_city1` dataset for the different values in the `property_type` column:

In [19]:
import numpy as np
df = pd.read_csv('data/mexico-city-real-estate-1.csv')
mexico_city1_pivot = df.pivot_table(values='price_usd', index = 'property_type', aggfunc = 'mean')
mexico_city1_pivot

Unnamed: 0_level_0,price_usd
property_type,Unnamed: 1_level_1
house,173553.66


### Subsetting with Masks

Another way to create subsets from a larger dataset is through masking. Masks are ways to filter out the data you're not interested in so that you can focus on the data you are. For example, we might want to look at properties in Colombia that are bigger than 200 square meters. In order to create this subset, we'll need to use a mask.

First, we'll reset our `df1` DataFrame so that we can draw on all the data in its original form. Then we'll create a statement and then assign the result to `mask`.

In [20]:
df1 = pd.read_csv("data/mexico-city-real-estate-1.csv")
mask = df1["area_m2"] > 200
mask.head()

0    False
1    False
2     True
3    False
4    False
Name: area_m2, dtype: bool

Notice that `mask` is a Series of Boolean values. Where properties are smaller than 200 square meters, our statement evaluates as `False`; where they're bigger than 200, it evaluates to `True`.

Once we have our mask, we can use it to select all the rows from `df1` that evaluate as `True`.

In [21]:
df1[mask].head()

Unnamed: 0,property_type,department,lat-lon,area_m2,price_usd
2,house,A,"4.535,-75.676",235.0,219474.47


### What's a pivot table?

A pivot table allows you to quickly aggregate and summarize a DataFrame using an aggregation function. For example, to build a pivot table that summarizes the mean of the price_cop column for each of the unique categories in the property_type column in df2:

In [33]:
import numpy as np

pivot1 = pd.pivot_table(df, values="price_usd", index="property_type", aggfunc=np.mean)
pivot1

  pivot1 = pd.pivot_table(df, values="price_usd", index="property_type", aggfunc=np.mean)


Unnamed: 0_level_0,price_usd
property_type,Unnamed: 1_level_1
house,173553.66
