## SUMMARY STATISTICS FOR CATEGORICAL VARIABLES

In [1]:
import numpy as np
import pandas as pd

### About DataSet
The dataset we’ll explore in this lesson is a sample of the NYC 2015 Tree Census. This dataset contains information from a survey of trees in the city collected by parks department employees and community volunteers

**tree_id:** Unique identifier for each tree in the survey<br>
**trunk_diam :**	Diameter of the tree measured 54” above the ground<br>
**status :**	Indicates whether the tree is alive, standing dead, or a stump.<br>
**health :**	Indicates the user’s perception of tree health.<br>
**spc_common :**	Common name for species, e.g. “red maple”<br>
**neighborhood :**	Name of the neighborhood the tree is located in


In [2]:
nyc_trees = pd.read_csv("nyc_tree_census.csv")
nyc_trees.head()

Unnamed: 0,tree_id,trunk_diam,status,health,spc_common,neighborhood
0,199250,8,Alive,Good,crab apple,Lincoln Square
1,136891,17,Alive,Good,honeylocust,East Harlem North
2,200218,3,Alive,Good,ginkgo,Chinatown
3,53901,23,Alive,Good,green ash,Bayside-Bayside Hills
4,589218,21,Alive,Good,pin oak,Glen Oaks-Floral Park-New Hyde Park


### Exercise 1
Which of the columns are categorical variables? Write the names into a list named categorical_vars. Each name should be a separate string. Although id fields (for example, tree_id) can technically be considered categorical data, you do not need to include them in your list.

In [13]:
categorical_vars = nyc_trees.select_dtypes("object").columns.to_list()
categorical_vars

['status', 'health', 'spc_common', 'neighborhood']

### Nominal Categories
Depending on the data, some of the summary statistics we use for quantitative data can still be meaningful for categorical data. Let’s first consider a nominal categorical variable. A nominal categorical variable is a categorical variable with no intrinsic ordering to the categories.

### Exercise 2
Using the nyc_trees data, find the count of trees in each neighborhood. Save the result as tree_counts and print the result.

In [16]:
tree_counts = nyc_trees["neighborhood"].value_counts()
tree_counts

Annadale-Huguenot-Prince's Bay-Eltingville    950
Great Kills                                   761
East New York                                 702
Bayside-Bayside Hills                         665
Rossville-Woodrow                             633
                                             ... 
48                                              1
69                                              1
39                                              1
65                                              1
BX33                                            1
Name: neighborhood, Length: 442, dtype: int64

### Exercise 3 
Using the nyc_trees data, find the neighborhood with the highest tree count. Save the name of the neighborhood as a variable called greenest_neighborhood and print the result.

In [26]:
greenest_neighborhood = nyc_trees["neighborhood"].value_counts().index[0]
greenest_neighborhood

"Annadale-Huguenot-Prince's Bay-Eltingville"

### Ordinal Categorical Variables
Ordinal categorical variables have ordered categories. For ordinal categorical variables, we can find the modal category just like in the previous exercise — but we can also calculate other summary statistics that are not possible for nominal categorical variables. For central tendency, this means we can also calculate a median.

### Exercise 4
Using the NYC trees dataset, find the unique values in the column health. Save the unique categories to a variable named tree_health_statuses and print the result

In [32]:
tree_health_statuses = nyc_trees["health"].unique()
tree_health_statuses

array(['Good', 'Poor', 'Fair', nan], dtype=object)

### Exercise 5
Create a list named health_categories which lists the categories from worst to best. You should exclude NaN values from your list.

Using the health_categories list you created in the previous exercise, convert health in the original dataset to a categorical variable type ('category').

In [38]:
health_categories = ["Poor", "Fair", "Good"]

In [41]:
nyc_trees["health"].dtypes

dtype('O')

In [44]:
nyc_trees["health"] = pd.Categorical(nyc_trees["health"], categories = health_categories, ordered = True)

In [46]:
nyc_trees["health"].dtypes

CategoricalDtype(categories=['Poor', 'Fair', 'Good'], ordered=True)

### Exercise 6
Using cat.codes, calculate the value that corresponds to the median value of health. Save it as a variable named median_health_status and print the result.

In [54]:
median_index = np.median(nyc_trees["health"].cat.codes)
median_index

2.0

In [59]:
median_health_status = health_categories[int(median_index)]
median_health_status

'Good'

### Important Note
When we use .cat.codes to translate these categories into integers, those integers have equal spacing. While translating categories to numbers is often necessary to store and use the order of the categories (for calculating a statistic like the median, which only relies on ordering, not spacing), we should not use those numbers to calculate statistics — such as the mean — for which the distance between values matters.

In practice, researchers sometimes (albeit, incorrectly) report means for ordinal categories. For example, a researcher might want to analyze survey responses to the question "Rate your happiness on a scale from 1 to 5 where 1 means 'very unhappy' and 5 means 'very happy'".

If that researcher calculates 'mean happiness score', they are assuming that the difference in happiness between a rating of 1 and 2 is the same as the difference in happiness for a rating of 3 and 4. In practice, this assumption is likely not true and should be acknowledged if reporting a mean of an ordinal categorical variable.

In [61]:
nyc_trees = pd.read_csv("nyc_tree_census2.csv")
nyc_trees.head()

Unnamed: 0,tree_id,trunk_diam,status,health,spc_common,neighborhood,tree_diam_category
0,199250,8,Alive,Good,crab apple,Lincoln Square,Medium-Large (10-18in)
1,136891,17,Alive,Good,honeylocust,East Harlem North,Large (18-24in)
2,200218,3,Alive,Good,ginkgo,Chinatown,Medium (3-10in)
3,53901,23,Alive,Good,green ash,Bayside-Bayside Hills,Very large (>24in)
4,589218,21,Alive,Good,pin oak,Glen Oaks-Floral Park-New Hyde Park,Very large (>24in)


### Exercise 7
This dataset contains two variables related to trunk size. The first variable, trunk_diam contains the diameter of the trunk (in inches) for each tree. The variable tree_diam_category, on the other hand, categorizes each tree based on the size of the trunk. The categories are: 'Small (0-3in)', 'Medium (3-10in)', 'Medium-Large (10-18in)', 'Large (18-24in)','Very large (>24in)'. You’ll notice that these categories are not evenly spaced with respect to diameter.

Calculate the mean of trunk_diam (the quantitative variable), save it as mean_diam, and print the result.

In [63]:
mean_diam = np.mean(nyc_trees["trunk_diam"])
mean_diam

11.27048

### Exercise 8
We’ve already provided code in script.py to save tree_diam_category as an ordered categorical variable so that you can use cat.codes. Calculate the mean of tree_diam_category, save it in a variable named mean_diam_cat and print it out.

Which category does this correspond to (remember that cat.codes translates the categories to numbers between 0 and 4)? Note how this is different from the mean you calculated in the last checkpoint. While the mean diameter is about 11.27 inches (which would be categorized as “Medium-Large”), the mean category index is about 1.97, which is between 'Medium (3-10in)' and 'Medium-Large (10-18in)'.

In [68]:
nyc_trees["tree_diam_category"].unique()

array(['Medium-Large (10-18in)', 'Large (18-24in)', 'Medium (3-10in)',
       'Very large (>24in)', nan, 'Small (0-3in)'], dtype=object)

In [80]:
size_labels_ordered = ['Small (0-3in)', 'Medium (3-10in)', 'Medium-Large (10-18in)', 'Large (18-24in)', 'Very large (>24in)']
size_labels_ordered

['Small (0-3in)',
 'Medium (3-10in)',
 'Medium-Large (10-18in)',
 'Large (18-24in)',
 'Very large (>24in)']

In [81]:
nyc_trees["tree_diam_category"] = pd.Categorical(nyc_trees["tree_diam_category"], categories = size_labels_ordered, ordered = True)
nyc_trees["tree_diam_category"]

0        Medium-Large (10-18in)
1               Large (18-24in)
2               Medium (3-10in)
3            Very large (>24in)
4            Very large (>24in)
                  ...          
49995    Medium-Large (10-18in)
49996           Large (18-24in)
49997                       NaN
49998    Medium-Large (10-18in)
49999             Small (0-3in)
Name: tree_diam_category, Length: 50000, dtype: category
Categories (5, object): ['Small (0-3in)' < 'Medium (3-10in)' < 'Medium-Large (10-18in)' < 'Large (18-24in)' < 'Very large (>24in)']

In [82]:
mean_diam_cat = np.mean(nyc_trees["tree_diam_category"].cat.codes)
mean_diam_cat

1.97282

### Ordinal Categories: Spread
In the last exercise, we learned that the mean is not interpretable for ordinal categorical variables because the mean relies on the assumption of equal spacing between categories.

Many other statistics we might normally use for numerical data rely on the mean. Because of this, these statistics aren’t appropriate for ordinal data. Remember that the standard deviation and variance both depend on the mean, without a mean, we can’t have a reliable standard deviation or variance either!

Instead, we can rely on other summary statistics, like the proportion of the data within a range, or percentiles/quantiles.

### Exercise 9
Calculate the 25th percentile for tree_diam_category. Use the ordered list, size_labels_ordered, to find the corresponding label. Save your result (the label, not the index) to a variable named p25_tree_diam_category and print it to the console.

In [88]:
np.percentile(nyc_trees["tree_diam_category"].cat.codes, 25)

1.0

In [90]:
p25_tree_diam_category = size_labels_ordered[int(np.percentile(nyc_trees["tree_diam_category"].cat.codes, 25))]
p25_tree_diam_category

'Medium (3-10in)'

### Exercise 10
Calculate the 75th percentile of tree_diam_category. Use the ordered list, size_labels_ordered, to find the corresponding label. Save your result (the label, not the index) to a variable named p75_tree_diam_category and print it to the console.

Together with the 25th percentile, we can use this value to determine the Interquartile Range (IQR) for tree_diam_category.

In [93]:
np.percentile(nyc_trees["tree_diam_category"].cat.codes, 75)

3.0

In [94]:
p75_tree_diam_category = size_labels_ordered[int(np.percentile(nyc_trees["tree_diam_category"].cat.codes, 75))]
p75_tree_diam_category

'Large (18-24in)'

### Table of Proportions
You’ve already seen that we can use the .value_counts() function to get a table of frequencies for a categorical variable. A table of frequencies is often the first approach a data scientist might use to summarize a categorical variable; however, it is sometimes useful to instead look at the proportion of values in each category.

In [97]:
nyc_trees = pd.read_csv("./nyc_tree_census.csv")
nyc_trees.head()

Unnamed: 0,tree_id,trunk_diam,status,health,spc_common,neighborhood
0,199250,8,Alive,Good,crab apple,Lincoln Square
1,136891,17,Alive,Good,honeylocust,East Harlem North
2,200218,3,Alive,Good,ginkgo,Chinatown
3,53901,23,Alive,Good,green ash,Bayside-Bayside Hills
4,589218,21,Alive,Good,pin oak,Glen Oaks-Floral Park-New Hyde Park


### Exercise 11

Calculate a table of proportions for the status column. Save this table of proportions as tree_status_proportions and print the result.

In [101]:
tree_status_proportions = nyc_trees["status"].value_counts(normalize = True)
tree_status_proportions

Alive    0.9539
Stump    0.0267
Dead     0.0194
Name: status, dtype: float64

### Table of Proportions: Missing Data
A table of frequencies is often the first approach a data scientist might use to summarize a categorical variable; however, it is sometimes useful to instead look at the proportion of values in each category.NaN.

### Exercise 12
Using .value_counts(), calculate the proportions for each category in the health variable. The denominator for your proportions should be the number of non-missing values in the health column. Save the result to a dataframe named health_proportions and print the result.

In [103]:
health_proportions = nyc_trees["health"].value_counts(normalize=True)
health_proportions

Good    0.810986
Fair    0.146871
Poor    0.042143
Name: health, dtype: float64

### Exercise 13
Now, still using .value_counts(), add a parameter to include missing values in the denominator when calculating proportions for the health variable. Save the result to a dataframe named health_proportions_2. Why are the two sets of results different? Can you think of scenarios where one might be more appropriate to report than the other?

In [105]:
health_proportions_2 = nyc_trees["health"].value_counts(normalize=True, dropna=False)
health_proportions_2

Good    0.7736
Fair    0.1401
NaN     0.0461
Poor    0.0402
Name: health, dtype: float64

### Binary Categorical Variables
Binary categorical variables have only two categories. In Python, these variables are often coded as 0/1 or True/False. This makes it easy to calculate the frequency/proportion of these variables in a dataset

### Exercise 14
Find the frequency and proportion of trees that were recorded as Alive. You can do this by transforming the status variable to an indicator for if a tree is alive (indicated by status == 'Alive') or not. Save the results to variables named living_frequency and living_proportion and print them to the console.

In [119]:
nyc_trees["status"] == "Alive"

0         True
1         True
2         True
3         True
4         True
         ...  
49995     True
49996     True
49997     True
49998     True
49999    False
Name: status, Length: 50000, dtype: bool

In [116]:
living_frequency = np.sum(nyc_trees["status"] == "Alive")


47695

In [118]:
living_proportion = (nyc_trees["status"] == "Alive").mean()
living_proportion

0.9539

### Exercise 15
Find the frequency and proportion of trees with trunk_diam > 30. Save the results to variables named giant_frequency and giant_proportion and print them to the console.

In [124]:
nyc_trees["trunk_diam"] > 30

0        False
1        False
2        False
3        False
4        False
         ...  
49995    False
49996    False
49997    False
49998    False
49999    False
Name: trunk_diam, Length: 50000, dtype: bool

In [126]:
giant_frequency = np.sum(nyc_trees["trunk_diam"] > 30)
giant_frequency

1788

In [128]:
giant_proportion = (nyc_trees["trunk_diam"] > 30).mean()
giant_proportion

0.03576

### Review

In this lesson you’ve learned the steps you can take to summarize and interpret summaries of nominal categorical and ordinal categorical variables.

- For nominal categorical variables, there is no ordering to the categories. Because of this, we’re limited to using the mode to describe central tendency and there is no way to summarize the spread.

- For ordinal categorical variables, there is an implied ordering to the categories. In Python, we can use pd.Categorical() to transform a variable to a categorical type. The Categorical type allows us to access a numeric value for each category by using .cat.codes. From there, we may perform operations on this variable as if it were a regular, numeric variable.

- However, when calculating statistics for an ordinal categorical variable we should be mindful that some numeric statistics rely on the assumption of equal spacing between categories.

- For ordinal categorical variables, median and mode can be used to summarize the central tendency, and the IQR (or any difference between percentiles) can be used to summarize the spread.

- Certain summary statistics (e.g. frequencies and proportions), can be used for all categorical variables. You can create true/false columns and np.sum() and np.mean() to quickly summarize what proportion of your data meets certain criteria.