## <center> Exploring Narcisstic Personality Data

## About Data
We’ll explore a sample of data(npi_sample) from the Narcissistic Personality Inventory, a personality test with 40 questions about personal preferences and self-view.<br>
There are two possible responses to each question. The sample we’ll be working with contains responses to the following:
- **influence:** `yes` = I have a natural talent for influencing people; `no` = I am not good at influencing people.
- **blend_in:** `yes` = I prefer to blend in with the crowd; `no` = I like to be the center of attention.
- **special:** `yes` = I think I am a special person; `no` = I am no better or worse than most people.
- **leader:** `yes` = I see myself as a good leader; `no` = I am not sure if I would make a good leader.
- **authority:** `yes` = I like to have authority over other people; `no` = I don’t mind following orders.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

## Introduction

##### Exercise 1
Load the dataset and print the first five rows of this dataframe using the ``.head()``. <br>
Which of these variables do you think might be associated?

In [2]:
npi = pd.read_csv("npi_sample.csv")
npi.head()

Unnamed: 0,influence,blend_in,special,leader,authority
0,no,yes,yes,yes,yes
1,no,yes,no,no,no
2,yes,no,yes,yes,yes
3,yes,no,no,yes,yes
4,yes,yes,no,yes,no


## Contingency Tables: Frequencies
Contingency tables, also known as two-way tables or cross-tabulations, are useful for summarizing two variables at the same time.<br>
For example, suppose we are interested in understanding whether there is an association between `influence` and `leader`. <br>
We can use the `crosstab` function from pandas to create a contingency table.

##### Exercise 2
Do you think there will be an association between `special` and `authority`?<br>
Create a contingency table for these two variables and store the table as `special_authority_freq`, then print out the result.

Based on this table, do you think the variables are associated?

In [7]:
special_authority_freq = pd.crosstab(npi["special"], npi["authority"])
special_authority_freq

authority,no,yes
special,Unnamed: 1_level_1,Unnamed: 2_level_1
no,4069,1905
yes,2229,2894


In [5]:
# if we know how someone responded to the authority question, 
# we have some information about how they are likely to respond to the special question. 
# This suggests that the variables are associated.

##### Exercise 3
Do you think there will be an association between `influence` and `leader`?<br>
Create a contingency table for these two variables and store the table as `influence_leader_freq`, then print out the result.

Based on this table, do you think the variables are associated?

In [16]:
influence_leader_freq = pd.crosstab(npi["influence"], npi["leader"])
influence_leader_freq

leader,no,yes
influence,Unnamed: 1_level_1,Unnamed: 2_level_1
no,3015,1293
yes,2360,4429


In [17]:
# if we know how someone responded to the leadership question, 
# we have some information about how they are likely to respond to the influence question. 
# This suggests that the variables are associated.

## Contingency Tables: Proportions

##### Exercise 4
Convert `special_authority_freq` to a table of proportions and save the result as `special_authority_prop`, then print it out.

In [15]:
special_authority_prop = special_authority_freq / len(npi)
special_authority_prop

authority,no,yes
special,Unnamed: 1_level_1,Unnamed: 2_level_1
no,0.366676,0.171668
yes,0.200865,0.260791


In [6]:
# if we know how someone responded to the leadership question, 
# we have some information about how they are likely to respond to the influence question. 
# This suggests that the variables are associated.

##### Exercise 5
Convert `influence_leader_freq` to a table of proportions and save the result as `influence_leader_prop`, then print it out.

In [14]:
influence_leader_prop = influence_leader_freq / len(npi)
influence_leader_prop

leader,no,yes
influence,Unnamed: 1_level_1,Unnamed: 2_level_1
no,0.271695,0.116518
yes,0.21267,0.399117


## Marginal Proportions
The proportion of respondents in each category of a single question is called a marginal proportion.

##### Exercise 6
Calculate the marginal proportions for the `authority` variable and save the result as `authority_marginals`.<br>
Print out `authority_marginals`. Do more people like to have authority over people or not?

In [27]:
authority_marginals = special_authority_prop.sum(axis=0)[1]
authority_marginals

0.4324592232134812

In [28]:
# No only 43% of people like to have authority over people

##### Exercise 7
Calculate the marginal proportions for the `special` variable and save the result as `special_marginals`.<br>
Print out `special_marginals`. Do more people see themselves as special or not special?

In [35]:
special_marginals = special_authority_prop.sum(axis=1)[1]
special_marginals

0.46165630350545195

In [None]:
# No only 46% of people see themselves as special.

##### Exercise 8
Calculate the marginal proportions for the `influence` variable and save the result as `influence_marginals`.<br>
Print out `influence_marginals`. Do more people think they have a talent for influencing people or not.

In [48]:
influence_marginals = influence_leader_prop.sum(axis=1)[1]
influence_marginals

0.611786969451203

In [49]:
# Yes 61% of people think they have a talent for influencing people.

##### Exercise 9
Calculate the marginal proportions for the `leader` variable and save the result as `leader_marginals`.<br>
Print out `leader_marginals`. Do more people see themselves as a leader or not leader?

In [52]:
influence_leader_prop.sum(axis=0)[1]

0.5156348562674596

In [53]:
# Respondents are approximately split on whether they see themselves as a leader

## Expected Contingency Tables
The more that the expected and observed tables differ, the more sure we can be that the variables are associated.

#####  Exercise 10
Use the `chi2_contingency()` function to calculate the expected frequency table for the `special` and `authority` questions if there were no association. Save the result as expected.

In [54]:
special_authority_freq

authority,no,yes
special,Unnamed: 1_level_1,Unnamed: 2_level_1
no,4069,1905
yes,2229,2894


In [55]:
chi2, pval, dof, expected = stats.chi2_contingency(special_authority_freq)

In [57]:
expected

array([[3390.48860052, 2583.51139948],
       [2907.51139948, 2215.48860052]])

##### Exercise 11
Use ``np.round()`` to print out the expected contingency table, with values rounded to the nearest whole number. <br>
Compare this to the observed frequency table. How much do the numbers in these tables differ?

In [58]:
np.round(expected)

array([[3390., 2584.],
       [2908., 2215.]])

In [59]:
print("expected contingency table (no association):")

expected contingency table (no association):


##### Exercise 12
Use the `chi2_contingency()` function to calculate the expected frequency table for the `influence` and `leader` questions if there were no association. Save the result as expected.

In [60]:
influence_leader_freq

leader,no,yes
influence,Unnamed: 1_level_1,Unnamed: 2_level_1
no,3015,1293
yes,2360,4429


In [62]:
chi2, pval, dof, expected = stats.chi2_contingency(influence_leader_freq)
expected

array([[2086.6450392, 2221.3549608],
       [3288.3549608, 3500.6450392]])

##### Exercise 13
Use ``np.round()`` to print out the expected contingency table, with values rounded to the nearest whole number. <br>
Compare this to the observed frequency table. How much do the numbers in these tables differ?

In [63]:
np.round(expected)

array([[2087., 2221.],
       [3288., 3501.]])

In [64]:
# we see some pretty big differences (eg., 3015 in the observed table compared to 2087 in the expected table).

## The Chi-Square Statistic
In the previous exercise, we calculated a contingency table of expected frequencies if there were no association between the `leader` and `influence` questions.<br>

We then compared this to the observed contingency table. Because the tables looked somewhat different, we concluded that responses to these questions are probably associated.

While we can inspect these tables visually, many data scientists use the Chi-Square statistic to summarize how different these two tables are.

The Chi-Square statistic is the first output of the `SciPy` function `chi2_contingency()`

The interpretation of the Chi-Square statistic is dependent on the size of the contingency table. For a 2x2 table, a Chi-Square statistic larger than around 4 would strongly suggest an association between the variables.

##### Exanple 12
Use the `chi2_contingency()` function to calculate Chi-Square statistic for the `special` and `authority` variables.<br>
Save the result as chi2 and print it out. Do these variables appear to be associated?

In [67]:
chi2, pval, dof, expected = stats.chi2_contingency(special_authority_freq)

In [68]:
chi2
# Yes special and authority variables appear to be associated.

679.1219526170606

##### Exanple 13
Use the `chi2_contingency()` function to calculate Chi-Square statistic for the `influence` and `leader` variables.<br>
Save the result as chi2 and print it out. Do these variables appear to be associated?


In [72]:
chi2, pval, dof, expected = stats.chi2_contingency(influence_leader_freq)

In [73]:
chi2
# Yes influence and leader variables appear to be associated.

1307.8836807573769