# Gender vs Phobias


### Author : Eric H. Lewis

Last update: 6/11/2021

## Table of Contents 

* Objective / Use-Cases
* Data / Data Cleaning
* Exploratory Data Analysis
    * Hypothesis Testing
* Results
* Conclusions

## Objective 

---

Is there power in key demographics?

_ |_
:-------------------------:|:-------------------------:
![Imgur](https://i.imgur.com/sshvZGd.png)| ![Imgur](https://i.imgur.com/K4AabGL.png)

## Data & Data Cleaning 

The data that is being used is the [Young People Survey](https://www.kaggle.com/miroslavsabo/young-people-survey?select=responses.csv) dataset. 
Students of the Statistics class at FSEV UK in Slovakia asked people they knew to participate in the survey.

* Over 1000 survey participants.
* Electronic and Written form.
* All participants were of Slovakian nationality, aged between 15-30





## Feature Selection

Demographics of Interest

* Gender

Category
* Phobias

Demographics not Chosen
* Age

#### Demographics that were not chosen:
_ |_
:-------------------------:|:-------------------------:
![Imgur](https://i.imgur.com/O33bAmX.png) | ![Imgur](https://i.imgur.com/9QKhE8B.png)


In [None]:
## TODO subplots


## Histogram
plt.hist(demographics['Age'], bins = 16, align='right', color='b', edgecolor='black',
              linewidth=1)

plt.xlim(14, 31)
plt.ylim(0, 225)

plt.xlabel("Years Old")
plt.ylabel("Frequency")
plt.title("Age Distribution of Survey Participants")
plt.show()
plt.savefig('img/age_dist.jpeg')


# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = ['15 yr', '16 yr', '17 yr', '18 yr', '19 yr', '20 yr', '21 yr', '22 yr', '23 yr', '24 yr', '25 yr', '26 yr', '27 yr', '28 yr', '29 yr', '30 yr']
sizes = np.array([demographics['Age'].value_counts(normalize=True).sort_index()]).flatten()
explode = (0.1, 0.1, 0.1, 0, 0, 0, 0, 0.1, 0.1, 0.1, 0.1 ,0.1 ,0.1 ,0.1 ,0.1 ,0.1) 
# only "explode" the 2nd slice (i.e. 'Hogs')

fig1, ax1 = plt.subplots(figsize = (7, 9.5))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle= 0)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

# ax1.title('Age')
ax1.legend(loc = "lower center", framealpha=1, ncol = 8, borderpad =1, shadow=True)

plt.title('Age of Survey Participants', loc='center', y = .90)
plt.show()
plt.savefig('img/age_pie.png')


sns.set_theme(style="darkgrid")
ax = sns.countplot(x = phobias1['Gender'], data=phobias1)


###### Data Sample

![Imgur](https://i.imgur.com/8zgInip.png)



![Imgur](https://i.imgur.com/0w8HGA2.png)

## Exploratory Data Analysis
### Hypothesis Testing



###### Null Hypothesis:
    
$H_0$ : Phobias are independent of a person's Gender.
    

###### Alternate Hypothesis:


$H_1$ : Phobias and Gender are not independent and there exists a relationship between them.




## $\chi^2$ test of independence 

$$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$

$$\alpha = 0.05 $$ 


## Bonferroni Corrected p-value

$$Bonferroni \ corrected \ p-value = \frac{p_0}{n}$$

$$where: \ \ \ \ p_0 = original \ p-value$$ 
$$n = \# \ of \ tests \ performed$$ 

_ |_
:-------------------------:|:-------------------------:
![Imgur](https://i.imgur.com/3skCNll.png) | ![Imgur](https://i.imgur.com/55i3dFx.png)

## Cross Tabulation 
_ |_
:-------------------------:|:-------------------------:
![Imgur](https://i.imgur.com/IfMXRxj.png) |![Imgur](https://i.imgur.com/eqLbPEA.png)

<!-- <img src="img/gendervsflying.png" width="350" align="center"/>
<img src="img/Fear2.png" width="500" align="center"/> -->

## crosstab_chi2

![Imgur](https://i.imgur.com/xFHZL3j.png)

## Crosstab Output

![Imgur](https://i.imgur.com/bNOgM4U.png)


## bonferroni_adjustment

![Imgur](https://i.imgur.com/TfilGMe.png)

### Results

|  Original Alpha | Corrected Alpha for Bonferroni method |
| :---: | :---: |
| 0.05 | 0.004545454545454546 |

| Gender vs Phobia | Original p-val | Bonferroni Corrected p-val| Reject Null? |
| :---: | :---: | :---: | :---: |
| Flying             | 7.208e-04  | 7.93e-03 | True |
| Thunder, Lightning | 4.6241𝑒−22 | 5.09e-21 | True |
| Darkness           | 9.9074𝑒−23 | 1.09e-21 | True |
| Heights            | 0.0989     | 1.0      | False|
| Spiders            | 6.432𝑒−24  | 7.08e-23 | True |
| Snakes             | 1.531𝑒−11  | 1.68e-10 | True |
| Rats, Mice         | 2.724𝑒−15  | 2.99e-14 | True |
| Ageing             | 1.289𝑒−05  | 1.42e-04 | True |
| Dangerous Dogs     | 6.972𝑒−10  | 7.67e-09 | True |
| Public Speaking    | 2.502e-04  | 2.75e-03 | True |


## Conclusions

There is a relationship between the Gender demographics and a person's phobias.

* Future-Steps 

# The End

## Code

In [1]:
import pandas as pd
from scipy import stats

import numpy as np
import seaborn as sns
import statsmodels as sm

import matplotlib.pyplot as plt
import csv

from IPython.display import display, Math
from mpl_toolkits.mplot3d import Axes3D

In [2]:
with open('archive/columns.csv', newline='') as csvfile:
    reader = csv.reader(csvfile)
    with open('archive/columns_dict.csv', mode='w') as outfile:
        writer = csv.writer(outfile)
        mydict = {rows[1]:rows[0] for rows in reader}

In [3]:
responses = pd.read_csv('archive/responses.csv')

responses.describe()

Unnamed: 0,Music,Slow songs or fast songs,Dance,Folk,Country,Classical music,Musical,Pop,Rock,Metal or Hardrock,...,Shopping centres,Branded clothing,Entertainment spending,Spending on looks,Spending on gadgets,Spending on healthy eating,Age,Height,Weight,Number of siblings
count,1007.0,1008.0,1006.0,1005.0,1005.0,1003.0,1008.0,1007.0,1004.0,1007.0,...,1008.0,1008.0,1007.0,1007.0,1010.0,1008.0,1003.0,990.0,990.0,1004.0
mean,4.731877,3.328373,3.11332,2.288557,2.123383,2.956132,2.761905,3.471698,3.761952,2.36147,...,3.234127,3.050595,3.201589,3.106256,2.870297,3.55754,20.433699,173.514141,66.405051,1.297809
std,0.664049,0.833931,1.170568,1.138916,1.076136,1.25257,1.260845,1.1614,1.184861,1.372995,...,1.323062,1.306321,1.188947,1.205368,1.28497,1.09375,2.82884,10.024505,13.839561,1.013348
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,15.0,62.0,41.0,0.0
25%,5.0,3.0,2.0,1.0,1.0,2.0,2.0,3.0,3.0,1.0,...,2.0,2.0,2.0,2.0,2.0,3.0,19.0,167.0,55.0,1.0
50%,5.0,3.0,3.0,2.0,2.0,3.0,3.0,4.0,4.0,2.0,...,3.0,3.0,3.0,3.0,3.0,4.0,20.0,173.0,64.0,1.0
75%,5.0,4.0,4.0,3.0,3.0,4.0,4.0,4.0,5.0,3.0,...,4.0,4.0,4.0,4.0,4.0,4.0,22.0,180.0,75.0,2.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,30.0,203.0,165.0,10.0


## Custom Functions

In [None]:
def crosstab_chi2(index, columns, alpha_signif):
    ''' The function will return the chi-squared test statistic, associated p-value,
        degrees of freedom, and the expected freq array.
        
    Parameters
    -----------
    index : Column 1 of dataframe that is not subject to change. i.e : Gender
    
    columns : Column 2 of dataframe that you want to check if dependent on index.
    
    alpha_signif : The chosen level of signficance to keep in mind.
    
    Returns
    --------
    Tuple(chi2_test_stat, p_val, dof, expec_freq_arr)
    
    '''
    
    crosstab = pd.crosstab(index, columns)
    #display(crosstab)
    
    # Expected frequencies for the n-dimensional contingency table based on the 
    # marginal sums under the assumption that the groups associated with each 
    # dimension are independent.
    
    (chi2_test_stat, p_val, dof, expec_freq_arr) = stats.chi2_contingency(crosstab)

    return (chi2_test_stat, p_val, dof, expec_freq_arr) 


def bonferroni_adjustment(dataframe, alpha_sig):
    ''' The function will return the Bonferroni-Corrected p-values for multiple tests,
    the corrected alpha for the Bonferroni method, and the boolean for if a hypothesis 
    can be rejected given alpha.
    
    Parameters
    -----------
    dataframe : A data frame with columns[:len(df-1)] of variables you want to test
    versus the last column[:-1].
    alpha_sig : The chosen level of significance.
    
    Returns
    --------
    reject_arr : an array of bool for the hypotheses that can be rejected given alpha
    
    pval_corrected : an array of p-values corrected for multiple tests
    
    alphaC_bonf : corrected alpha for Bonferroni method
    '''
    
    p_val_arr = []
    reject_arr = []
    
    for column in dataframe:
        (chi2_test_stat, p_val, dof, expec_freq_arr) = crosstab_chi2(dataframe.iloc[:, -1], dataframe[column], 0.05)
        p_val_arr.append(p_val)
    
    
    
    (reject_arr, pval_corrected, alphaC_sidak, 
     alphaC_bonf) = sm.stats.multitest.multipletests(p_val_arr,
                                                     alpha=0.05, 
                                                     method='bonferroni', 
                                                     is_sorted=False, 
                                                     returnsorted=False)
    print(reject_arr, pval_corrected, alphaC_bonf)
 

## Getting the Dataframes

In [4]:
demographics = pd.DataFrame([responses[x] for x in list(responses)[140:]]).transpose()
demographics.sample(2)

Unnamed: 0,Age,Height,Weight,Number of siblings,Gender,Left - right handed,Education,Only child,Village - town,House - block of flats
145,19.0,165.0,78.0,1.0,female,right handed,secondary school,no,village,house/bungalow
274,22.0,170.0,60.0,0.0,female,right handed,secondary school,yes,city,house/bungalow


In [5]:
preferences = pd.DataFrame([responses[x] for x in list(responses)[:140]]).transpose()
preferences = preferences.rename(columns = mydict)
# preferences.sample(2)

In [6]:
phobias = pd.DataFrame([preferences[x] for x in list(preferences)[63:73]]).transpose()
phobias1 = phobias.copy()
phobias1['Gender'] = demographics['Gender']
# phobias1.sample(13)

In [None]:
for column in phobias1:
        print(crosstab_chi2(phobias1.iloc[:, -1], phobias1[column], 0.05))

## Bon-Correction

In [None]:
bonferroni_adjustment(phobias1, 0.05)

In [None]:
male_phob1 = phobias.copy()
male_phob1['Gender'] = demographics['Gender']
male_phob1 = male_phob1.loc[male_phob1['Gender'] == 'male'].copy()
arr_m = np.array([male_phob1['Flying'].value_counts().sort_index()])
female_phob1 = phobias.copy()
female_phob1['Gender'] = demographics['Gender']
female_phob1 = female_phob1.loc[female_phob1['Gender'] == 'female'].copy()
arr_f = np.array([female_phob1['Flying'].value_counts().sort_index()])

In [None]:
#1 - 2 - 7
fig = plt.figure(figsize = (9, 10))
ax = fig.add_subplot(111, projection ="3d")


x = np.array([1,2])

y = np.array([1,2, 3, 4, 5])              # correct

xpos, ypos = np.meshgrid(x, y)

z = np.array([arr_m.flatten(), arr_f.flatten()])

xpos = xpos.flatten()
ypos = ypos.flatten()

# colors = ['b', 'g'] *3
zpos = np.zeros_like(xpos)

dx = 0.5*np.ones_like(zpos)
dy = dx.copy()
dz = np.array([241, 222, 132, 73, 133, 74, 50, 20, 36, 20])


ax.set_xticks([2.25, 1.25])
ax.set_xticklabels(['Male', 'Female'])

ax.set_yticks([1, 2, 3, 4, 5])
ax.set_yticklabels(['1.0', '2.0', '3.0', '4.0', '5.0'])

# ax.set_xlim()
# ax.set_ylim()

ax.set_zlim(0,250)

ax.view_init(azim = 45)


ax.set_title("Fear of Flying by Gender", fontsize =20)
ax.set_xlabel("Gender", labelpad =10)
ax.set_ylabel("Likert Value", labelpad =10)
ax.set_zlabel("Count", labelpad = 10)

ax.bar3d(xpos, ypos, zpos, dx, dy, dz)
# plt.tight_layout()
plt.show()

plt.savefig("img/FearvsGender.png")

In [None]:
!pip install jupyterthemes

from jupyterthemes import get_themes
import jupyterthemes as jt
from jupyterthemes.stylefx import set_nb_theme

set_nb_theme('chesterish')