## Data Analysis Template

***

### Markdown Guides

> This is a blockquote.

Some of these words *are emphasized*.

Use two asterisks for **strong emphasis**.

*   Another item in the list.

This is an [example link](http://example.com/).

$x = x + y$

[text to appear as link](#linkhandle)

Images inline
![image](https://imgbbb.com/images/2019/12/18/Screenshot-2019-12-18-at-12.55.36-PM.png)

## Project Description

The assignments for this course start where the Data Management and Visualization course  assignments left off. Now that you have selected a data set and research question, managed your variables of interest and visualized their relationship graphically, we are ready to test those relationships statistically. We have included the codebooks and data sets from Data Management and Visualization for your convenience. The first assignment deals with analysis of variance. Analysis of variance assesses whether the means of two or more groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means (quantitative variables) of groups (categorical variables). The null hypothesis is that there is no difference in the mean of the quantitative variable across groups (categorical variable), while the alternative is that there is a difference. Note that if your research question does not include one quantitative variable, you can use one from your data set just to get some practice with the tool. If your research question does not include a categorical variable, you can categorize one that is quantitative.

## Data Dictionary

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
| incomeperperson|	2010 Gross Domestic Product per capita in constant 2000 US$.|
| alcconsumption|	2008 alcohol consumption per adult (age 15+), litres|
| armedforcesrate|	Armed forces personnel (% of total labor force)|
| breastcancerper100TH|	2002 breast cancer new cases per 100,000 female|
| co2emissions|	2006 cumulative CO2 emission (metric tons)|
| femaleemployrate|	2007 female employees age 15+ (% of population)|
| employrate|	2007 total employees age 15+ (% of population)|
| HIVrate|	2009 estimated HIV Prevalence %|
| Internetuserate|	2010 Internet users (per 100 people)|
| lifeexpectancy|	2011 life expectancy at birth (years)|
| oilperperson|	2010 oil Consumption per capita (tonnes per year and person)|
| polityscore|	2009 Democracy score (Polity)|
| relectricperperson|	2008 residential electricity consumption, per person (kWh)|
| suicideper100TH|	2005 Suicide, age adjusted, per 100 000|
| urbanrate|	2008 urban population (% of total)|

## Summary

Run an analysis of variance.

You will need to analyze and interpret post hoc paired comparisons in instances where your original statistical test was significant, and you were examining more than two groups (i.e. more than two levels of a categorical, explanatory variable). 

**P value is 0.109 > 0.05, hence cannot reject null hypothesis.**

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols
import datetime
from datetime import datetime, timedelta
import scipy.stats
import pandas_profiling
from pandas_profiling import ProfileReport


%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
sns.set_style('dark')
sns.set(font_scale=1.2)

plt.rc('axes', titlesize=9)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)

import warnings
warnings.filterwarnings('ignore')

# Use Folium library to plot values on a map.
#import folium

# Use Feature-Engine library
#import feature_engine
#import feature_engine.missing_data_imputers as mdi
#from feature_engine.outlier_removers import Winsorizer
#from feature_engine import categorical_encoders as ce
#from feature_engine.discretisation import EqualWidthDiscretiser, EqualFrequencyDiscretiser, DecisionTreeDiscretiser
#from feature_engine.encoding import OrdinalEncoder

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)
pd.option_context('float_format','{:.2f}'.format)

np.random.seed(0)
np.set_printoptions(suppress=True)

Autosaving every 60 seconds


In [2]:
df = pd.read_csv("gapminderfinal5.csv")

In [3]:
df

Unnamed: 0,incomeperperson,alcconsumption,armedforcesrate,breastcancerper100th,co2emissions,femaleemployrate,hivrate,internetuserate,lifeexpectancy,oilperperson,polityscore,relectricperperson,suicideper100th,employrate,urbanrate,demoscorecat,co2cat,incomecat,alccat,electricat
0,8740.97,0.03,0.57,27,76.0,25.60,1.94,4,49,1.48,0,1173.18,7,55.70,24.04,1,1,3,0,0.155844
1,1915.00,7.29,1.02,57,224.0,42.10,1.94,45,77,1.48,9,636.34,8,51.40,46.72,3,2,1,3,0.101449
2,2231.99,0.69,2.31,24,2932.0,31.70,0.10,12,73,0.42,2,590.51,5,50.50,65.22,2,3,2,0,0.101449
3,21943.34,10.17,1.44,37,5033.0,47.55,1.94,81,70,1.48,4,1173.18,5,58.64,88.92,2,4,4,4,0.155844
4,1381.00,5.57,1.46,23,248.0,69.40,2.00,10,51,1.48,-2,173.00,15,75.70,56.70,1,2,1,2,0.101449
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
208,722.81,3.91,1.09,16,1425.0,67.60,0.40,28,75,1.48,-7,302.73,12,71.00,27.84,0,3,1,1,0.101449
209,8740.97,6.69,5.94,37,14.0,11.30,1.94,36,73,1.48,4,1173.18,10,32.00,71.90,2,0,3,2,0.155844
210,610.36,0.20,2.32,35,235.0,20.30,1.94,12,65,1.48,-2,130.06,6,39.00,30.64,1,2,0,0,0.101449
211,432.23,3.56,0.34,13,132.0,53.50,13.50,10,49,1.48,7,168.62,12,61.00,35.42,3,2,0,1,0.101449


## Exploratory Data Analysis

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213 entries, 0 to 212
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   incomeperperson       213 non-null    float64
 1   alcconsumption        213 non-null    float64
 2   armedforcesrate       213 non-null    float64
 3   breastcancerper100th  213 non-null    int64  
 4   co2emissions          213 non-null    float64
 5   femaleemployrate      213 non-null    float64
 6   hivrate               213 non-null    float64
 7   internetuserate       213 non-null    int64  
 8   lifeexpectancy        213 non-null    int64  
 9   oilperperson          213 non-null    float64
 10  polityscore           213 non-null    int64  
 11  relectricperperson    213 non-null    float64
 12  suicideper100th       213 non-null    int64  
 13  employrate            213 non-null    float64
 14  urbanrate             213 non-null    float64
 15  demoscorecat          2

In [5]:
df.describe()

Unnamed: 0,incomeperperson,alcconsumption,armedforcesrate,breastcancerper100th,co2emissions,femaleemployrate,hivrate,internetuserate,lifeexpectancy,oilperperson,polityscore,relectricperperson,suicideper100th,employrate,urbanrate,demoscorecat,co2cat,incomecat,alccat,electricat
count,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0,213.0
mean,8740.966338,6.689484,1.443052,37.323944,5033.244131,47.549531,1.936854,35.685446,69.751174,1.481362,3.765258,1173.17939,9.685446,58.63662,56.76939,2.061033,2.0,2.0,1.99061,0.126761
std,13466.912542,4.589345,1.498692,20.443277,24936.503422,13.364005,3.632102,26.418255,9.241981,0.987116,5.487663,1341.777091,5.955782,9.61196,23.275759,1.009872,1.420869,1.420869,1.417514,0.076255
min,103.78,0.03,0.0,4.0,0.0,11.3,0.06,0.0,48.0,0.03,-10.0,0.0,0.0,32.0,10.4,0.0,0.0,0.0,0.0,0.0
25%,952.83,3.23,0.57,23.0,38.0,40.3,0.2,12.0,65.0,1.48,1.0,431.63,6.0,53.5,37.34,2.0,1.0,1.0,1.0,0.101449
50%,3665.35,6.69,1.21,35.0,235.0,47.55,1.2,36.0,72.0,1.48,4.0,1173.18,10.0,58.64,56.77,2.0,2.0,2.0,2.0,0.101449
75%,8740.97,9.5,1.44,44.0,2422.0,53.6,1.94,52.0,76.0,1.48,8.0,1173.18,12.0,63.7,73.5,3.0,3.0,3.0,3.0,0.155844
max,105147.44,23.01,10.64,101.0,334221.0,83.3,25.9,96.0,83.0,12.23,10.0,11154.76,36.0,83.2,100.0,3.0,4.0,4.0,4.0,0.291667


In [6]:
df.columns

Index(['incomeperperson', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson', 'polityscore', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate', 'demoscorecat', 'co2cat', 'incomecat', 'alccat', 'electricat'], dtype='object')

## Hypothesis Testing

The goal of hypothesis testing is to answer the question, “Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?” The first step is to quantify the size of the apparent effect by choosing a test statistic (t-test, ANOVA, etc). The next step is to define a null hypothesis, which is a model of the system based on the assumption that the apparent effect is not real. Then compute the p-value, which is the probability of the null hypothesis being true, and finally interpret the result of the p-value, if the value is low, the effect is said to be statistically significant, which means that the null hypothesis may not be accurate.

### T-Test

We will be using the t-test for independent samples. For the independent t-test, the following assumptions must be met.

-   One independent, categorical variable with two levels or group
-   One dependent continuous variable
-   Independence of the observations. Each subject should belong to only one group. There is no relationship between the observations in each group.
-   The dependent variable must follow a normal distribution
-   Assumption of homogeneity of variance


State the hypothesis

-   $H_0: µ\_1 = µ\_2$ ("there is no difference in evaluation scores between male and females")
-   $H_1: µ\_1 ≠ µ\_2$ ("there is a difference in evaluation scores between male and females")


### Levene's Test

In [7]:
# scipy.stats.levene(ratings_df[ratings_df['gender'] == 'female']['eval'],
#                    ratings_df[ratings_df['gender'] == 'male']['eval'], center='mean')

## T-Test

### One Sample T-Test

In [8]:
# t, p = scipy.stats.ttest_1samp(a=df.dose, popmean=1.166667)

In [9]:
# print("T-test value is: ", t)
# print("p-value value is: ", p)

### Two Samples T-Test

In [10]:
#t, p = scipy.stats.ttest_ind(a=df.len,b=df.dose, equal_var = True/False)

In [11]:
# print("T-test value is: ",t)
# print("p-value value is: ",p)

### ANOVA

We are testing does democracy score affects income per person:

Explainary variable: Democracy score

Target variable: incomeperperson

State the hypothesis

-   $H_0: µ\_1 = µ\_2$ (the two population means are equal)
-   $H_1:$ At least one of the means differ


In [12]:
df.columns

Index(['incomeperperson', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson', 'polityscore', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate', 'demoscorecat', 'co2cat', 'incomecat', 'alccat', 'electricat'], dtype='object')

In [13]:
df = df[['incomeperperson','demoscorecat']]

In [14]:
df.head()

Unnamed: 0,incomeperperson,demoscorecat
0,8740.97,1
1,1915.0,3
2,2231.99,2
3,21943.34,2
4,1381.0,1


### One Way ANOVA

In [15]:
mod = ols('demoscorecat~incomeperperson', data=df).fit()

In [16]:
aov_table = sm.stats.anova_lm(mod,typ=2)

In [17]:
aov_table

Unnamed: 0,sum_sq,df,F,PR(>F)
incomeperperson,2.619442,1.0,2.587714,0.109191
Residual,213.58713,211.0,,


In [18]:
print(mod.summary())

                            OLS Regression Results                            
Dep. Variable:           demoscorecat   R-squared:                       0.012
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     2.588
Date:                Mon, 22 Mar 2021   Prob (F-statistic):              0.109
Time:                        18:53:11   Log-Likelihood:                -302.53
No. Observations:                 213   AIC:                             609.1
Df Residuals:                     211   BIC:                             615.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           1.9889      0.082     

In [32]:
mod.pvalues

Intercept          9.588761e-63
incomeperperson    1.091909e-01
dtype: float64

In [19]:
# f_statistic, p_value = scipy.stats.f_oneway(forty_lower, forty_fiftyseven, fiftyseven_older)
# print("F_Statistic: {0}, P-Value: {1}".format(f_statistic,p_value))

### Two-way ANOVA

In [20]:
#mod1 = ols('len~supp+dose', data=df).fit()

In [21]:
#aov1 = sm.stats.anova_lm(mod1,typ=2)

In [22]:
#aov1

### Chi-square

State the hypothesis:

-   $H_0:$ The proportion of teachers who are tenured is independent of gender
-   $H_1:$ The proportion of teachers who are tenured is associated with gender

In [23]:
#Create a Cross-tab table

# cont_table  = pd.crosstab(ratings_df['tenure'], ratings_df['gender'])
# cont_table

In [24]:
#scipy.stats.chi2_contingency(cont_table, correction = True)

### Correlation

State the hypothesis:

-   $H_0:$ Teaching evaluation score is not correlated with beauty score
-   $H_1:$ Teaching evaluation score is correlated with beauty score


In [25]:
#scipy.stats.pearsonr(ratings_df['beauty'], ratings_df['eval'])

## Regression Analysis

In [26]:
df.columns

Index(['incomeperperson', 'demoscorecat'], dtype='object')

In [27]:
y = df['incomeperperson']
X = df['demoscorecat']

In [28]:
X = sm.add_constant(X)

In [29]:
model = sm.OLS(y,X).fit()

In [30]:
model.summary()

0,1,2,3
Dep. Variable:,incomeperperson,R-squared:,0.012
Model:,OLS,Adj. R-squared:,0.007
Method:,Least Squares,F-statistic:,2.588
Date:,"Mon, 22 Mar 2021",Prob (F-statistic):,0.109
Time:,18:53:12,Log-Likelihood:,-2325.6
No. Observations:,213,AIC:,4655.0
Df Residuals:,211,BIC:,4662.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5715.7502,2093.276,2.731,0.007,1589.338,9842.163
demoscorecat,1467.8156,912.459,1.609,0.109,-330.888,3266.519

0,1,2,3
Omnibus:,187.676,Durbin-Watson:,1.981
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2932.661
Skew:,3.446,Prob(JB):,0.0
Kurtosis:,19.821,Cond. No.,6.05


#### Python code done by Dennis Lam