# Lab Assignment 10: Exploratory Data Analysis, Part 1
## DS 6001: Practice and Application of Data Science

- Name: Congxin (David) Xu
- Computing ID: cx2rx
- Room Number: 4

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.

In this lab, you will be working with the 2018 [General Social Survey (GSS)](http://www.gss.norc.org/). The GSS is a sociological survey created and regularly collected since 1972 by the National Opinion Research Center at the University of Chicago. It is funded by the National Science Foundation. The GSS collects information and keeps a historical record of the concerns, experiences, attitudes, and practices of residents of the United States, and it is one of the most important data sources for the social sciences. 

The data includes features that measure concepts that are notoriously difficult to ask about directly, such as religion, racism, and sexism. The data also include many different metrics of how successful a person is in his or her profession, including income, socioeconomic status, and occupational prestige. These occupational prestige scores are coded separately by the GSS.  The full description of their methodology for measuring prestige is available here: http://gss.norc.org/Documents/reports/methodological-reports/MR122%20Occupational%20Prestige.pdf Here's a quote to give you an idea about how these scores are calculated:

> Respondents then were given small cards which each had a single occupational titles listed on it. Cards were in English or Spanish. They were given one card at a time in the preordained order. The interviewer then asked the respondent to "please put the card in the box at the top of the ladder if you think that occupation has the highest possible social standing. Put it in the box of the bottom of the ladder if you think it has the lowest possible social standing. If it belongs somewhere in between, just put it in the box that matches the social standing of the occupation."

The prestige scores are calculated from the aggregated rankings according to the method described above.

### Problem 0
Import the following packages:

In [1]:
import numpy as np
import pandas as pd
import sidetable
# this is a module of wquantiles, so type pip install wquantiles 
#   or conda install wquantiles to get access to it
import weighted 
from scipy import stats 
from sklearn import manifold
from sklearn import metrics
import prince
from pandas_profiling import ProfileReport
pd.options.display.max_columns = None

Then load the GSS data with the following code:

In [2]:
%%capture
gss = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/gss2018.csv",
                 encoding='cp1252', na_values=['IAP','IAP,DK,NA,uncodeable', 'NOT SURE',
                                               'DK', 'IAP, DK, NA, uncodeable', '.a', "CAN'T CHOOSE"])

### Problem 1
Drop all columns except for the following:
* `id` - a numeric unique ID for each person who responded to the survey
* `wtss` - survey sample weights
* `sex` - male or female
* `educ` - years of formal education
* `region` - region of the country where the respondent lives
* `age` - age
* `coninc` - the respondent's personal annual income
* `prestg10` - the respondent's occupational prestige score, as measured by the GSS using the methodology described above
* `mapres10` - the respondent's mother's occupational prestige score, as measured by the GSS using the methodology described above
* `papres10` -the respondent's father's occupational prestige score, as measured by the GSS using the methodology described above
* `sei10` - an index measuring the respondent's socioeconomic status
* `satjob` - responses to "On the whole, how satisfied are you with the work you do?"
* `fechld` - agree or disagree with: "A working mother can establish just as warm and secure a relationship with her children as a mother who does not work."
* `fefam` - agree or disagree with: "It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family."
* `fepol` - agree or disagree with: "Most men are better suited emotionally for politics than are most women."
* `fepresch` - agree or disagree with: "A preschool child is likely to suffer if his or her mother works."
* `meovrwrk` - agree or disagree with: "Family life often suffers because men concentrate too much on their work."

Then rename any columns with names that are non-intuitive to you to more intuitive and descriptive ones. Finally, replace the "89 or older" values of `age` with 89, and convert `age` to a float data type. [1 point]

In [3]:
gss = gss[['id', 'wtss', 'sex','educ', 'region', 'age', 'coninc', 'prestg10',
    'mapres10', 'papres10', 'sei10', 'satjob', 'fechld', 'fefam', 
     'fepol', 'fepresch', 'meovrwrk']]
gss = gss.rename({"coninc": "annual_income",
                  'sei10':'socioeconomic_score',
                  'mapres10':'mom_pres_score',
                  'papres10':'dad_pres_score',
                  'fechld':'female_child',
                  'fefam':'female_family',
                  'fepol':'female_politics',
                  'fepresch':'female_preschool',
                  'meovrwrk':'men_work'
                 }, axis = 1)
gss.age = gss.age.replace('89 or older', 89).astype(float)
gss.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2348 entries, 0 to 2347
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   2348 non-null   int64  
 1   wtss                 2348 non-null   float64
 2   sex                  2348 non-null   object 
 3   educ                 2345 non-null   float64
 4   region               2348 non-null   object 
 5   age                  2341 non-null   float64
 6   annual_income        2152 non-null   float64
 7   prestg10             2248 non-null   float64
 8   mom_pres_score       1657 non-null   float64
 9   dad_pres_score       1842 non-null   float64
 10  socioeconomic_score  2248 non-null   float64
 11  satjob               1739 non-null   object 
 12  female_child         1550 non-null   object 
 13  female_family        1545 non-null   object 
 14  female_politics      1499 non-null   object 
 15  female_preschool     1536 non-null   o

### Problem 2
#### Part a
Use the `ProfileReport()` function to generate and embed an HTML formatted exploratory data analysis report in your notebook. Make sure that it includes a "Correlations" report along with "Overview" and "Variables". [1 point]

In [4]:
profile = ProfileReport(gss, 
                        title='Pandas Profiling Report',
                        html={'style':{'full_width':True}},
                        minimal = False)
profile.to_notebook_iframe()

HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=31.0), HTML(value='')))




HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Render HTML'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




#### Part b
Looking through the HTML report you displayed in part a, how many people in the data are from New England? [1 point]

**Answer:** There are 124 people in New England.

#### Part c
Looking through the HTML report you displayed in part a, which feature in the data has the highest number of missing values, and what percent of the values are missing for this feature? [1 point]

**Answer:** `female_politics` has the highest number of missing values and it has `36.2%` of missing values.

#### Part d
Looking through the HTML report you displayed in part a, which two distinct features in the data have the highest correlation? [1 point]

**Answer:** `socioeconomic_score` and `prestg10` have the highest correlation.

### Problem 3
On a primetime show on a 24-hour cable news network, two unpleasant-looking men in suits sit across a table from each other, scowling. One says "This economy is failing the middle-class. The average American today is making less than \\$48,000 a year." The other screams "Fake news! The typical American makes more than \$55,000 a year!" Explain, using words and code, how the data can support both of their arguments. Use the sample weights to calculate descriptive statistics that are more representative of the American adult population as a whole. [1 point]

In [5]:
# Weighted Mean
gss_p3 = gss[gss['annual_income'].notna()]
np.average(gss_p3.annual_income, weights=gss_p3.wtss)

55158.96280421564

In [6]:
# Weighted Median
weighted.median(gss_p3.annual_income,gss_p3.wtss)

47317.5

The numbers are reversed.

### Problem 4
For each of the following parts, 
* generate a table that provides evidence about the relationship between the two features in the data that are relevant to each question, 
* interpret the table in words, 
* use a hypothesis test to assess the strength of the evidence in the table, 
* and provide a **specific and accurate** intepretation of the $p$-value associated with this hypothesis test beyond "significant or not". 

#### Part a
Is there a gender wage gap? That is, is there a difference between the average incomes of men and women? [2 points]

In [7]:
gss[['sex', 'annual_income']].groupby('sex').mean()

Unnamed: 0_level_0,annual_income
sex,Unnamed: 1_level_1
female,47191.021452
male,53314.626187


In [8]:
annual_income_men = gss.query("sex=='male'").annual_income
annual_income_women = gss.query("sex=='female'").annual_income
stats.ttest_ind(annual_income_men, annual_income_women, equal_var=False)

Ttest_indResult(statistic=nan, pvalue=nan)

Based on the p-values, we will reject null hypothesis. Therefore, we can conclude that there is a difference between the average income of men and women. A p-value is the probability that a test statistic could be as extreme as it is in the sample under the assumption that the null hypothesis (no relationship, equal means, etc.) is true.

#### Part b
Are there different average values of occupational prestige for different levels of job satisfaction? [2 points]

In [9]:
gss[['prestg10', 'satjob']].groupby('satjob').mean()

Unnamed: 0_level_0,prestg10
satjob,Unnamed: 1_level_1
a little dissat,40.946429
mod. satisfied,42.589984
very dissatisfied,43.0
very satisfied,46.18932


In [10]:
stats.f_oneway(gss.query("satjob=='a little dissat'").prestg10.dropna(),
               gss.query("satjob=='mod. satisfied'").prestg10.dropna(),
               gss.query("satjob=='very dissatisfied'").prestg10.dropna(),
               gss.query("satjob=='very satisfied'").prestg10.dropna())

F_onewayResult(statistic=12.205403153509732, pvalue=6.676686425029878e-08)

Here the p-value is about .0000002, which is the probability that under the assumption that men and women approve of Biden equally, on average, that we could draw a sample with a difference between these two means of 4.68 or higher. Because this probability is lower than .05, we can reject the null hypothesis and conclude that there is a statisitically significant difference between men and women in terms of how highly they rate Joe Biden.


### Problem 5
Report the Pearson's correlation between years of education, socioeconomic status, income, occupational prestige, and a person's mother's and father's occupational prestige? Then perform a hypothesis test for the correlation between years of education and socioeconomic status and provide a **specific and accurate** intepretation of the $p$-value associated with this hypothesis test beyond "significant or not". [2 points]

In [11]:
gss[['educ', 'socioeconomic_score', 'annual_income', 'prestg10', 'mom_pres_score', 'dad_pres_score']].corr()

Unnamed: 0,educ,socioeconomic_score,annual_income,prestg10,mom_pres_score,dad_pres_score
educ,1.0,0.558169,0.389245,0.479933,0.269115,0.261417
socioeconomic_score,0.558169,1.0,0.41721,0.835515,0.203486,0.210451
annual_income,0.389245,0.41721,1.0,0.340995,0.164881,0.171048
prestg10,0.479933,0.835515,0.340995,1.0,0.189262,0.19218
mom_pres_score,0.269115,0.203486,0.164881,0.189262,1.0,0.23575
dad_pres_score,0.261417,0.210451,0.171048,0.19218,0.23575,1.0


In [12]:
gss_corr = gss[['educ', 'socioeconomic_score']].dropna()
stats.pearsonr(gss_corr['educ'], gss_corr['socioeconomic_score'])

(0.5581686004626782, 3.719448810018995e-184)

The first number is the correlation coefficient, which is -.65. The negative number means that the more highly someone rates Trump, the lower they tend to rate Biden, which is not surprising. The p-value is the second number, which is so small that it rounds to 0 over 16 decimal places. The p-value is the probability that a random sample could produce a correlation as extreme as .65 in either direction assuming that the correlation is 0 in the population. Because the p-value is so small, we reject the null hypothesis that these two features are uncorrelated and we conclude that there is a nonzero correlation between the Biden and Trump thermometers.

### Problem 6
Create a new categorical feature for age groups, with categories for 18-35, 36-49, 50-69, and 70 and older (see the module 8 notebook for an example of how to do this). 

Then create a cross-tabulation in which the rows represent age groups and the columns represent responses to the statement that "It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family." Rearrange the columns so that they are in the following order: strongly agree, agree, disagree, strongly disagree. Place row percents in the cells of this table.

Finally, use a hypothesis test that can tell use whether there is enough evidence to conclude that these two features have a relationship, and provide a specific and accurate intepretation of the $p$-value. [2 points]

In [16]:
gss['age_group'] = pd.cut(gss.age, 
                          bins=[18,36,50,70,200], 
                          labels=("18-35", "36-49", "50-69", "70 and older"))
gss['age_group'].value_counts()

50-69           779
18-35           690
36-49           538
70 and older    312
Name: age_group, dtype: int64

In [17]:
gss['men_work'] = gss['men_work'].astype('category')
gss['men_work'] = gss['men_work'].cat.reorder_categories(['strongly agree',
                                                          'agree',
                                                          'neither agree nor disagree', 
                                                          'disagree',
                                                          'strongly disagree'])
pd.crosstab(gss.age_group, gss.men_work)

men_work,strongly agree,agree,neither agree nor disagree,disagree,strongly disagree
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
18-35,35,130,70,155,26
36-49,33,144,54,107,20
50-69,49,243,64,146,22
70 and older,16,122,31,54,5


### Problem 7
For this problem, you will conduct and interpret a correspondence analysis on the categorical features that ask respondents to state the extent to which they agree or disagree with the statements:
* "A working mother can establish just as warm and secure a relationship with her children as a mother who does not work."
* "It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family."
* "Most men are better suited emotionally for politics than are most women."
* "A preschool child is likely to suffer if his or her mother works."
* "Family life often suffers because men concentrate too much on their work."

#### Part a
Conduct a correspondence analysis using the observed features listed above that measures two latent features. Plot the two latent categories for each category in each of the features used in the analysis. [2 points]

In [20]:
gss_cat = gss[['female_child','female_family','female_politics','female_preschool','men_work']].dropna()
gss_cat

Unnamed: 0,female_child,female_family,female_politics,female_preschool,men_work
0,strongly agree,disagree,agree,strongly disagree,agree
2,strongly agree,disagree,disagree,disagree,disagree
3,agree,disagree,disagree,disagree,neither agree nor disagree
5,strongly agree,disagree,disagree,disagree,agree
8,disagree,strongly disagree,disagree,agree,agree
...,...,...,...,...,...
2341,disagree,strongly agree,agree,disagree,agree
2343,disagree,strongly disagree,disagree,strongly disagree,disagree
2344,strongly agree,disagree,disagree,disagree,disagree
2346,disagree,agree,disagree,strongly agree,agree


#### Part b
Display the latent features for every category in the observed features, sorted by the first latent feature. Describe in words what concept this feature is attempting to measure, and give the feature a name. [2 points]

#### Part c
We can use the results of the MCA model to conduct some cool EDA. For one example, follow these steps:

1. Use the `.row_coordinates()` method to calculate values of the latent feature for every row in the data you passed to the MCA in part a. Extract the first column and store it in its own dataframe.

2. To join it with the full, cleaned GSS data based on row numbers (instead of on a primary key), use the `.join()` method. For example, if we named the cleaned GSS data `gss_clean` and if we named the dataframe in step 1 `latentfeature`, we can type
```
gss_clean = gss_clean.join(latentfeature, how="outer")
```
3. Create a cross-tabuation with age categories (that you constructed in problem 5) in the rows and sex in the columns. Instead of a frequency, place the mean value of the latent feature in the cells. 

What does this table tell you about the relationship between sex, age, and the latent feature? [2 points]