<br>

First, we will import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations. It is invaluable for analyzing datasets. 

### Import Packages

In [None]:
import numpy as np
import pandas as pd

from pandas import DataFrame
from pandas import Series

<br>

We can check which version of various packages we're using. You can see I'm running PANDAS 0.17 here.

In [None]:
print pd.__version__

<br>

PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 200)

<br>The next four lines are for various graphing options

In [None]:
import matplotlib.pyplot as plt

In [None]:
#NECESSARY FOR XTICKS OPTION, ETC.
from pylab import*

In [None]:
%matplotlib inline  

In [None]:
import seaborn as sns
print sns.__version__

In [None]:
plt.rcParams['figure.figsize'] = (10, 7.5)

<br>To make sure PANDAS always returns a float

In [None]:
from __future__ import division

<br>I like suppressing scientific notation in my numbers. So, if you'd rather see "0.48" than "4.800000e-01", then run the following line. Note that this does not change the actual values. For outputting to CSV we'll have to run some additional code later on.

In [None]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

### Read in Data

PANDAS can read in data from a variety of different data types. You will read in a CSV file containing data from the Qualtrics survey you took last class.

In the following four lines we'll first import the CSV file and assign it to the name 'df' -- short for 'dataframe', the PANDAS name for a dataset. Second, we'll use the <i>len</i> function to see how many columns (variables) there are in the dataset, then we'll use the <i>len</i> function again to see how many rows (students) there are in the dataset; there are 43 observations in total. Finally, we will show the 'head' of the dataset -- the first 5 rows.

In [None]:
df = pd.read_csv('http://social-metrics.org/wp-content/uploads/2016/06/COM205_June-27-2016_01.18.csv')
print '# of columns:', len(df.columns)
print '# of observations:', len(df)
df.head()

<br>The opposite of 'head' is 'tail' -- we can use it to inspect the last few observations in the dataframe. As a default 5 rows are chosen; here let's specify 2 rows. 

In [None]:
df.tail(2)

#### Inspect Data Types for Columns
In PANDAS *object* indicates text columns and *int64* and *float64* indicate numerical data.

In [None]:
df.dtypes

#### Describe the Data
We can use the *describe* command to show the descriptive statistics (also known as *summary statistics*) for the numerical variables in the dataset.

In [None]:
df.describe().T

### Plot Data - Boxplots

In [None]:
df.boxplot('Sleep', return_type='axes')

In [None]:
df.boxplot('Happiness', return_type='axes')

### Plot Hours of Sleep by Gender

In [None]:
df.boxplot(column='Sleep', by='Gender')

### Recode Sleep

#### Variable Frequencies

Before recoding the variable, let's first take a look at the frequencies for the different values of the variable. To see the frequencies for the different values of a variable, use the *value_counts()* command.

In [None]:
df['Sleep'].value_counts()

<br>Let's also take a look at a bar graph of the amount of sleep each student got. To get a more informative plot, we'll first *sort* the dataset and then plot it. We indicate a *bar* graph in the code. 

In [None]:
df.sort_values(by=['Sleep'], ascending=False)['Sleep'].plot(kind='bar')

<br>Based on the above plot, let's set the threshold at 6 hours of sleep per night. Those who got six or fewer hours of sleep per night will be considered 'low sleep', while everyone else will be considered 'normal' (or 'not low sleep').

What we will do in the following line of code is create a new dichotomous variable (also known as a *binary variable* or a *dummy variable*) called *Low_Sleep*. Numpy's *where* function is used to assign a value of *1* to all observations where the student got 6 or fewer hours of sleep, otherwise the student is coded as *0* on the variable *Low_Sleep*.

In [None]:
df['Low_Sleep'] = np.where(df['Sleep']<=6, 1, 0)

<br>It's always a good idea to confirm that you're recoding worked as expected. Let's do a *cross-tabulation* between the original and the newly recoded variables. 

In [None]:
pd.crosstab(df['Low_Sleep'], df['Sleep'])

<br>We can also do a *conditional* cross-tab to see the same result in a compressed format.

In [None]:
pd.crosstab(df['Low_Sleep'], df['Sleep']<=6)

<br>Let's plot *Happiness* level by the new variable *Low_Sleep*

In [None]:
df.boxplot(column='Happiness', by='Low_Sleep')

### Recode Happiness

In [None]:
df.head(2)

In [None]:
df['Happiness'].value_counts()

<br>Recall that the *Happiness* variable had 5 values. 1-2 were 'sad faces', 3 was 'neutral', and 4-5 were 'smiley faces.' Accordingly, let's create a new variable, *Happy*, where values of greater than 3 (i.e., scores of 4 or 5) are given a score of *1* on the variable, while values of 1, 2, or 3 are assigned scores of *0* on the new variable *Happy*.

In [None]:
df['Happy'] = np.where(df['Happiness']>3, 1, 0)
df.head()

<br>Once again, let's run some cross-tabulations to verify.

In [None]:
pd.crosstab(df['Happiness'], df['Happy'])

In [None]:
pd.crosstab(df['Happiness']>3, df['Happy'])

### Recode Gender

<br>Let's say you wanted to create a new variable called *Male*. You could use the same *np.where* command as above in order to help you do this. 

In [None]:
df['Male'] = np.where(df['Gender']=='Male', 1, 0)
df.head()

<br>To verify, let's run a cross-tab of *Male* with *Gender*

In [None]:
pd.crosstab(df['Male'], df['Gender'])

### Save New DataFrame

If you'd like, you could save the dataframe in PANDAS' native format. It's called 'pickling' a file, so we'll give it the typical 'pkl' extension.

In [None]:
df.to_pickle('Qualtrics Survey - COM205.pkl')

### T-Test

<br>Let's now run some statistics. First, let's try a t-test on *Gender* and *Happiness*. Recall that a t-test is a test of the difference in *means* between two groups -- in this case, between men and women. 

Let's start by computing the means for men and for women, then print the result.

In [None]:
print "Mean Happiness Level for Men:  ", df[df['Gender']=='Male']['Happiness'].mean()
print "Mean Happiness Level for Women:", df[df['Gender']=='Female']['Happiness'].mean()

<br>We can use a similar command, using *std( )* instead of *mean( )*, to get the standard deviations.

In [None]:
print "Standard Deviation of Happiness Level for Men:  ", df[df['Gender']=='Male']['Happiness'].std()
print "Standard Deviation of Happiness Level for Women:", df[df['Gender']=='Female']['Happiness'].std()

<br>We can also use the *len* function to get the number of observations for men and women.

In [None]:
print "Number of Observations for Men:  ",  len(df[df['Gender']=='Male'])
print "Number of Observations for Women:  ",  len(df[df['Gender']=='Female'])

### Running the t-test: Manually Calculated Approach

In [None]:
std_male = df[df['Gender']=='Male']['Happiness'].std()
std_female = df[df['Gender']=='Female']['Happiness'].std()
print std_male, std_female

In [None]:
mean_male = df[df['Gender']=='Male']['Happiness'].mean()
mean_female = df[df['Gender']=='Female']['Happiness'].mean()
print mean_male, mean_female

In [None]:
n_male = len(df[df['Gender']=='Male'])
n_female = len(df[df['Gender']=='Female'])
print n_male, n_female

In [None]:
t_numerator = mean_male - mean_female
t_numerator

In [None]:
print n_male*std_male**2
print n_male*(std_male**2)

In [None]:
t_denominator = math.sqrt(
                    ( (n_male*std_male**2 + n_female*std_female**2) /(n_male + n_female - 2) )* 
                    ( (n_male + n_female)/(n_male*n_female) )
                    ) 
t_denominator

In [None]:
t_numerator/t_denominator

## T-tests and Chi-Square Tests Using Statistical Packages

A much easier way to do this, however, is to use pre-programmed statistical packages.

### T-Test

For the t-test we'll use the *statsmodels* package. The t-test will return three values for us, as shown in the following block of code. 

In [None]:
import statsmodels.api as sm

'''
Returns
-------
tstat : float
    test statisic   
    --> "This is the t-statistic."
    --> "It is the ratio of the mean of the difference to the standard error of the difference..."
pvalue : float
    pvalue of the t-test
df : int or float
    degrees of freedom used in the t-test
'''

#### t-test for Gender and Sleep

Let's first run a t-test on *Gender* and *Sleep*. Is there a statistically significant difference?

In [None]:
result = sm.stats.ttest_ind(df[df['Gender']=='Male']['Sleep'], 
          df[df['Gender']=='Female']['Sleep'])
print result, '\n'
print 't-stat:', result[0], '\n',
print 'p-value:', result[1],  '\n',
print 'd.f.:', result[2]

#### t-test for Gender and Happiness

Now let's run a t-test on *Gender* and *Happiness*. Is there a statistically significant difference here?

In [None]:
result = sm.stats.ttest_ind(df[df['Gender']=='Male']['Happiness'], 
          df[df['Gender']=='Female']['Happiness'])
print result, '\n'
print 't-stat:', result[0], '\n',
print 'p-value:', result[1],  '\n',
print 'd.f.:', result[2]

### Chi-Square Test of Sleep and Happiness

We can try one more test. Let's look at the relationship between *Low_Sleep* and *Happy*. We'll use the *scipy* package for this. First we import the relevant part of the package.

In [None]:
import scipy.stats as scs

<rb>The question mark after a command is used to open a *help* dialogue box for the given command. We can try it now.

In [None]:
scs.chi2_contingency?
'''
Returns
-------
chi2 : float
    The test statistic.
p : float
    The p-value of the test
dof : int
    Degrees of freedom
expected : ndarray, same shape as `observed`
    The expected frequencies, based on the marginal sums of the table.
'''

<br>Before running the Chi-squared command, let's take a look at the cross-tab of *Low_Sleep* and *Happy* -- recall that the cross-tabulated data forms the basis for the chi-squared test.

In [None]:
pd.crosstab(df['Low_Sleep'], df['Happy'])

<br>Now we're ready to run the chi-squared test. Is there a statistically significant difference?

In [None]:
result = scs.chi2_contingency(pd.crosstab(df['Low_Sleep'], df['Happy']))
#print '\n'
print 'chi2:', result[0], '\n',
print 'p:', result[1],  '\n',
print '# of obs:', result[3].sum(), '\n',
print 'dof:', result[2]

<br>

For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter <a href='https://twitter.com/gregorysaxton'>@gregorysaxton</a>