# Exercise Nine: Numbers
This week, you'll be exploring the GSS dataset we worked within in the "Social Stats" exercise. Using our demo and the textbook as a guide, pick three new variables to explore. Your workflow should:

Import the current version of the file (available for download at the link above), and isolate the columns of interest based on the variables you want to include

- Using the variable navigator provided by GSS, determine the years applicable and narrow your dataset accordingly.
- Visualize at least two quantiative relationships or patterns: these might include connections between clear numerical values, such as age and income,
  or more complex visualizations based on boolean data (for example, our "yes" and "no" to reading fiction.)
- Group the data using at least two different divisions to spot interesting trends, and plot at least one variance across a group (refer to our example
  of happiness among fiction readers as a starting point.)

For a bonus challenge, try running another analysis using an advanced method such as summary statistics or cross tabulation.


## Stage One: Imports and Narrows by Column and Year
Import the current version of the file (available for download at the link above), and isolate the columns of interest based on the variables you want to include


In [201]:
import pandas as pd

# we restrict this (very large) dataset to the variables of interest
columns = ['relig', 'year', 'id', 'occ', 'marital', 'agewed', 'divorce','sibs','age']
df = pd.read_stata("GSS7218_R1.dta", columns=columns)

# further limit dataset to the years we are interested in
df = df.loc[df['year'].isin({1972})]
print(df.head)

<bound method NDFrame.head of            relig  year    id    occ        marital agewed divorce sibs   age
0         jewish  1972     1  205.0  NEVER MARRIED    NaN     NaN  3.0  23.0
1       catholic  1972     2  441.0        married   21.0      no  4.0  70.0
2     protestant  1972     3  270.0        married   20.0      no  5.0  48.0
3          other  1972     4    1.0        married   24.0      no  5.0  27.0
4     protestant  1972     5  385.0        married   22.0      no  2.0  61.0
...          ...   ...   ...    ...            ...    ...     ...  ...   ...
1608  protestant  1972  1609    NaN        married   17.0      no  7.0  69.0
1609  protestant  1972  1610  926.0        widowed   18.0      no  5.0  74.0
1610    catholic  1972  1611  280.0        married   18.0      no  1.0  35.0
1611       other  1972  1612  410.0        married   21.0      no  4.0  22.0
1612  protestant  1972  1613  715.0        married   24.0     yes  1.0  35.0

[1613 rows x 9 columns]>


In [202]:
# limit dataset to exclude records from individuals who didn't answer this survey
df = df.loc[df['marital'].notna()]

#remove people with no value for age
df = df.loc[df['agewed'].notna()]
print(df.head)

<bound method NDFrame.head of            relig  year    id    occ   marital agewed divorce sibs   age
1       catholic  1972     2  441.0   married   21.0      no  4.0  70.0
2     protestant  1972     3  270.0   married   20.0      no  5.0  48.0
3          other  1972     4    1.0   married   24.0      no  5.0  27.0
4     protestant  1972     5  385.0   married   22.0      no  2.0  61.0
6       catholic  1972     7  522.0  divorced   22.0     NaN  7.0  28.0
...          ...   ...   ...    ...       ...    ...     ...  ...   ...
1608  protestant  1972  1609    NaN   married   17.0      no  7.0  69.0
1609  protestant  1972  1610  926.0   widowed   18.0      no  5.0  74.0
1610    catholic  1972  1611  280.0   married   18.0      no  1.0  35.0
1611       other  1972  1612  410.0   married   21.0      no  4.0  22.0
1612  protestant  1972  1613  715.0   married   24.0     yes  1.0  35.0

[1395 rows x 9 columns]>


## Stage Two: Visualize Two Quantitative Aspects of the Data

Using the variable navigator provided by GSS, determine the years applicable and narrow your dataset accordingly.

In [210]:
df = df.replace('24 or older', 24)
df['marital'] = pd.to_numeric(df['agewed'])
print(df['agewed'].mean())

import matplotlib.pyplot as plt
df.groupby('age')['year'].mean().plot(kind='barh')
plt.xlabel('Mean Age by Highest Degree')
plt.legend();

TypeError: 'Categorical' does not implement reduction 'mean'

In [None]:

performance_counts = df['year'].value_counts()
labels=["Didn't Attend","Attended"]
colors=["#ff9999","#99ff99"]
explode = (0, 0.1)
fig1, ax1 = plt.subplots()
ax1.pie(performance_counts, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90)
ax1.axis('equal')  
plt.tight_layout()
plt.show()

## Stage Three: Use Groupby to Spot Additional Trends
Visualize at least two quantiative relationships or patterns: these might include connections between clear numerical values, such as age and income, or more complex visualizations based on boolean data (for example, our "yes" and "no" to reading fiction.)

In [None]:

exhibition_gender = df.groupby('artexbt')['sex'].value_counts()
exhibition_gender

In [None]:
import matplotlib.pyplot as plt
 
# Data to plot
labels = ['Attended Exhibit','Did Not Attend']
exhibition_counts = df['artexbt'].value_counts()
labels_gender = ['Female','Male','Female','Male']
colors = ['#ff6666', '#ffcc99', '#99ff99', '#66b3ff']
colors_gender = ['#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6']
 
# Plot
plt.pie(exhibition_counts, labels=labels, colors=colors, startangle=90,frame=True)
plt.pie(exhibition_gender, labels=labels_gender, colors=colors_gender,radius=0.75,startangle=90)
centre_circle = plt.Circle((0,0),0.5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
 
plt.axis('equal')
plt.tight_layout()
plt.show()

In [None]:
df['prfmnce'] = df['prfmnce'].replace(['No', 'Yes'], [0, 1])
df['artexbt'] = df['artexbt'].replace(['No', 'Yes'], [0, 1])
df.head()

df.groupby('degree')['prfmnce'].mean().plot(kind='barh')
plt.xlabel('Performance Attended Mean by Highest Degree')
plt.legend();

In [None]:
df.groupby('degree')['artexbt'].mean().plot(kind='barh')
plt.xlabel('Exhibit Attended Mean by Highest Degree')
plt.legend();

## Stage Four: ?
Group the data using at least two different divisions to spot interesting trends, and plot at least one variance across a group (refer to our example of happiness among fiction readers as a starting point.)

## Bonus Stage:
For a bonus challenge, try running another analysis using an advanced method such as summary statistics or cross tabulation.

