# Preliminaries

In [None]:
# Load pandas for dataframes:
import pandas as pd

# For plotting:
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib.ticker import FuncFormatter
import seaborn as sns

# Load SciPy for statistical testing:
import scipy


---

# Grouping: World Happiness Report 2016

In [None]:
whr_df_16 = pd.read_csv('WHR_2016.csv')

In [None]:
whr_df_16.shape

In [None]:
whr_df_16[:3]

## `size()` works as counting the N of observations in each group

In [None]:
whr_df_16.groupby('Region').size()

## `sum()`

In [None]:
whr_df_16.groupby('Region').sum()

The above works, but the results might not make sense. Other descriptions are more appropriate.

## Selecting a column and using more functions

### `mean()`

In [None]:
whr_df_16.groupby('Region')['Happiness Score'].mean()

### `median()`

In [None]:
whr_df_16.groupby('Region')['Happiness Score'].median()

## `agg()`  when applying multiple functions to grouped data

In [None]:
whr_df_16.groupby('Region')['Happiness Score'].agg(['size', 'mean', 'median', 'min', 'max'])

## Visualizing a distribution using `hist()`

In [None]:
whr_df_16['Economy (GDP per Capita)']

In [None]:
whr_df_16['Economy (GDP per Capita)'].hist()

## Quiz

* Suppose you want to group the countries into four groups based on the level of their GDP. A country is given "group_1" if its GDP is in the top quarter. A country is given "group_2" if its GDP is in the second quarter. A country is given "group_3" if its GDP is in the third quarter. A country is given "group_4" if its GDP is in the bottom quarter. 
* In whr_df_16, create a column called "GDP_four_groups" in which you have an indicator of which group a given country belongs to. 
* Print (1) the average and (2) the maximum happiness scores per GDP group.

*Hints*: 
* You might find it useful to use the `desc()` function.
* Yesterday, we saw the `.loc[]` operator was useful for setting column values

In [None]:
# TODO

---

# Grouping: Baltimore crime

Let's analyze a richer data set.

In [None]:
bmc_df = pd.read_csv('BPD_Part_1_Victim_Based_Crime_Data.csv')

In [None]:
bmc_df.info()

In [None]:
bmc_df[:3]

---

## Inside and Outside (Hypothesis 1)

* H1: The type of weapon used varies with whether a crime occurs inside or outside. <b/>

In [None]:
bmc_df['Inside/Outside'].unique()

In [None]:
bmc_df['Inside/Outside'] = bmc_df['Inside/Outside'].str.replace("Inside", "I")
bmc_df['Inside/Outside'] = bmc_df['Inside/Outside'].str.replace("Outside", "O")

In [None]:
bmc_df['Inside/Outside'].unique()

In [None]:
bmc_df['Weapon'].unique()

- <b> group the df by the two target columns <b>

In [None]:
bmc_df.groupby(['Weapon', 'Inside/Outside']).size()

- <b> `groupby()` and `sort_values()` vs. `value_counts()` <b>

In [None]:
# bmc_df.groupby('Description').size().sort_values(ascending=False)
bmc_df['Description'].value_counts()

In [None]:
bmc_df['District'].value_counts()

## Quiz 

* Using `value_counts()` print ten neighborhoods that have experienced the most homicides.


In [None]:
# TODO: solution

## More analysis


In [None]:
homicides_df = bmc_df[bmc_df['Description'] == 'HOMICIDE']
homicides_df

Here's a count of homicides per neighborhood, including the district to which each neighborhood belongs:

In [None]:
homicides_per_neighborhood_and_type_df = homicides_df.groupby(['District', 'Neighborhood', 'Description'])['Total Incidents'].sum().reset_index()
homicides_per_neighborhood_and_type_df


Let's figure out which neighborhood in each district has seen the most homicides:

In [None]:
# Approach 1
homicides_per_neighborhood_and_type_df.sort_values(by='Total Incidents', ascending=False).drop_duplicates('District')

In [None]:
# Approach 2
homicides_per_neighborhood_and_type_df.loc[homicides_per_neighborhood_and_type_df.groupby(["District"])["Total Incidents"].idxmax()]      

---

## Weather and Crimes (Hypothesis 2)

* Although archival data like `bmc_df` does not provide direct evidence for causality, its richness allows us to indirectly test prior findings and thus corroborate them. 
* For instance, the claim that warmer weather can make us more violent can be supported by examining `bmc_df`. 
* Let's check out this article first: [A hotter world will be a more violent world.](https://news.trust.org/item/20210811162128-r7vkq)
* Do you think you can (at least indirectly) test the claim with the current data? 
* If so, what's your approach? 

In [None]:
bmc_df[:3]

In [None]:
bmc_df['CrimeDate'][:1][0]

In [None]:
bmc_df['CrimeDate_dt'] = pd.to_datetime(bmc_df.CrimeDate, format="%m/%d/%Y")

In [None]:
bmc_df['CrimeDate_dt'][:1][0].month

In [None]:
bmc_df['year'] = bmc_df['CrimeDate_dt'].apply(lambda x: x.year)
bmc_df['month'] = bmc_df['CrimeDate_dt'].apply(lambda x: x.month)

In [None]:
bmc_df['month']

In [None]:
# bmc_df.groupby(['year', 'month']).size().sort_values(ascending=False)[:10]
bmc_df.value_counts(subset=['month', 'year'])

In [None]:
# bmc_df.groupby('year').size().sort_values(ascending=False)
bmc_df['year'].value_counts()

In [None]:
# bmc_df.groupby('month').size().sort_values(ascending=False)
bmc_df['month'].value_counts()

**Suppose you hypothesize that some types of violent crimes are more sensitive to weather:**
>- 'SHOOTING'
>- 'AGG. ASSAULT'
>- 'COMMON ASSAULT'
>- 'HOMICIDE'
>- 'ROBBERY - STREET'
>- 'RAPE'
    
- What would you do to gauge whether the hypothesis is worth exploring?

In [None]:
bmc_df['Description'].unique()

In [None]:
dir_violence = ['SHOOTING', 'AGG. ASSAULT', 'COMMON ASSAULT', 'HOMICIDE', 'ROBBERY - STREET', 'RAPE']

bmc_df[bmc_df['Description'].isin(dir_violence)]['month'].value_counts()

In [None]:
bmc_df[~bmc_df['Description'].isin(dir_violence)]['month'].value_counts()

---

# Visualization: Baltimore crime

## Built-in functions in Pandas

In [None]:
gb1 = bmc_df.groupby(['year']).size()

In [None]:
gb1

In [None]:
gb1.plot(kind='bar')

In [None]:
gb2 = bmc_df.groupby(['year', "District"]).size()
gb2.unstack().plot(kind='bar', stacked=True)

## Seaborn

In [None]:
gb1_for_sns = pd.DataFrame(gb1).reset_index().rename(columns={0:"count"})

In [None]:
gb1_for_sns

In [None]:
sns.barplot(data=pd.DataFrame(gb1_for_sns).reset_index(), x='year', y='count')

Drop 2017, the outlier year:

In [None]:
bmc_df_new = bmc_df[bmc_df.year!=2017]

In [None]:
bmc_df_new.year.unique()

### Yearly Crime Occurrences By District

In [None]:
gb3 = bmc_df_new.groupby(['year', "District"]).size().reset_index()
gb3.rename(columns={0:'count'}, inplace=True)

In [None]:
gb3.head()

In [None]:
sns.set(rc={'figure.figsize':(12, 6)}) 

ax = sns.lineplot(data=gb3, x='year', y='count', hue='District')

# Funcformatter is an imported function; we apply it to the labels:
ax.xaxis.set_major_formatter(FuncFormatter(lambda x, _: int(x)))

In [None]:
ax = sns.lineplot(data=gb3, x='year', y='count', hue='District')
ax.xaxis.set_major_formatter(FuncFormatter(lambda x, _: int(x)))

# Adding these two new lines:
sns.move_legend(ax, "upper right", bbox_to_anchor=(1.25, 1))
sns.set_theme(style="white")

### What about monthly patterns?

In [None]:
gb4 = bmc_df_new.groupby(['month', "District"]).size().reset_index()
gb4.rename(columns={0:'count'}, inplace=True)
gb4

In [None]:
sns.set(rc={'figure.figsize':(12, 8)})
sns.lineplot(data=gb4, x='month', y='count',hue='District' )

---

# Visualization: McDonald's

In [None]:
md_menu_df = pd.read_csv('menu_mcdonalds.csv')

In [None]:
md_menu_df.columns

In [None]:
md_menu_df.head()

In [None]:
md_menu_df.tail()

## Scatter plot

In [None]:
md_menu_df['Category'].unique()

### Using the built-in `plot()` in Pandas

In [None]:
md_menu_df.plot(kind='scatter', x='Sodium (% Daily Value)', y="Cholesterol (% Daily Value)")

### Seaborn's reg plot

In [None]:
sns.regplot(data=md_menu_df, x='Sodium (% Daily Value)', y="Cholesterol (% Daily Value)")

### Using Seaborn's `FacetGrid`

In [None]:
g = sns.FacetGrid(md_menu_df, col='Category', col_wrap=3, height=3.5, sharex=False, sharey=False)
g.map(sns.regplot, "Cholesterol (% Daily Value)", "Sodium (% Daily Value)", color='r')

---

# Statistical Testing: Starbucks

In [None]:
sb_df = pd.read_csv('menu_starbucks.csv')

## Histogram by Pandas

In [None]:
sb_df.hist(figsize=(15,9), bins=30)

## One-sample t-test of a population mean, with SciPy


This is the sample mean for calorie count:

In [None]:
sb_df['Calories'].mean()

What about the true population mean?

Is it equal to zero?

In this test, the Null hypothesis is that the population mean is equal to zero.

In [None]:
scipy.stats.ttest_1samp(sb_df['Calories'], popmean=0)

The t-stat has a high magnitude, and the p-value is very low. We can say with high certainty that the Null hypothesis is rejected.

Is the population mean equal to 200? In this test, the Null hypothesis is that the population mean is 200.

In [None]:
scipy.stats.ttest_1samp(sb_df['Calories'], popmean=200)

This time, the t-stat has a low magnitude, and the p-value is high. So we do not reject the Null. The population mean might be 200.

## Two-sample t-test for difference across populations, with SciPy

In [None]:
sb_df.Beverage_category.value_counts()

Let's compare the average (mean) calorie count of the two largest categories of drinks: Classic Espresso drinks versus Tazo Tea drinks.

In [None]:
cls_esp_caff = sb_df[sb_df['Beverage_category'] == 'Classic Espresso Drinks']['Caffeine (mg)'].astype(int)
cls_esp_caff[0:10]

In [None]:
# Running this line will fail:
tz_tea_caff = sb_df[sb_df['Beverage_category'] == 'Tazo® Tea Drinks']['Caffeine (mg)'].astype(int)


In [None]:
# Investigate:
sb_df[sb_df['Beverage_category'] == 'Tazo® Tea Drinks']['Caffeine (mg)']

In [None]:
# Exclude the non-integer entries before forcing a conversion:
tz_tea_caff = sb_df[
    (sb_df['Beverage_category'] == 'Tazo® Tea Drinks')
    & (sb_df['Caffeine (mg)'] != 'varies') & (sb_df['Caffeine (mg)'] != 'Varies')
]['Caffeine (mg)'].astype(int)
tz_tea_caff

In [None]:
plt.hist(cls_esp_caff, bins=20)

In [None]:
plt.hist(tz_tea_caff, bins=20)

In [None]:
cls_esp_caff.mean()

In [None]:
tz_tea_caff.mean()

The Null hypothesis here is that the two population means are identical:

The Null hypothesis here is that the two population means are identical:

From the large t-stat and small p-value, the two means do seem to be significantly different from one another.

Instead of a two-sided hypothesis, let's test some one-sided hypotheses.

Here, the Alternative Hypothesis is the first is less than the second; i.e. the Null hypothesis is that the first is greater than or equal to the second:

In [None]:
scipy.stats.ttest_ind(cls_esp_caff, tz_tea_caff, alternative='less')

Here, the Alternative Hypothesis is the first is greater than the second; i.e. the Null hypothesis is that the first is less than or equal to the second:

In [None]:
scipy.stats.ttest_ind(cls_esp_caff, tz_tea_caff, alternative='greater')

---

# Quizzes, using Starbucks data

## Quiz: Formulate and test your own hypothesis

In [None]:
# TODO: There are many other hypotheses one could test...
sb_df.columns

## Quiz: Visualization

* Suppose we aim to examine the relationship between Trans Fat and Saturated Fat in each category.
* Plot a FacetGrid of the nine values of Beverage Category wherein each subplot shows the relational pattern between the two variables.


In [None]:
# TODO, using the following:
sb_df[['Beverage_category', 'Trans Fat (g) ', 'Saturated Fat (g)']]