# Set up iPython, Import Data, Recode Data, etc.

<br>
First, we will import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations. It is invaluable for analyzing datasets. 

### Import Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pandas import DataFrame
from pandas import Series

<br>

We can check which version of various packages we're using. You can see I'm running PANDAS 0.17 here.

In [None]:
print pd.__version__

<br>

PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

<br>The next four lines are for various graphing options

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 200)

In [None]:
#NECESSARY FOR XTICKS OPTION, ETC.
from pylab import*

In [None]:
%matplotlib inline  

In [None]:
import seaborn as sns
print sns.__version__

In [None]:
plt.rcParams['figure.figsize'] = (10, 7.5)

<br>To make sure PANDAS always returns a float

In [None]:
from __future__ import division

<br>I like suppressing scientific notation in my numbers. So, if you'd rather see "0.48" than "4.800000e-01", then run the following line. Note that this does not change the actual values. For outputting to CSV we'll have to run some additional code later on.

In [None]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)

### Set your working directory 
Where you downloaded the Indiegogo file

In [None]:
cd "/Users/bobsmith/downloads"

### Read in Data

PANDAS can read in data from a variety of different data types. We will load in a dataset from the crowdfunding site Indiegogo with a random sample of 50 of the 3,177 total observations. So, in the following three lines we'll first import the Excel file and assign it to the name 'df' -- short for 'dataframe', the PANDAS name for a dataset. Second, we'll use the <i>len</i> function to see how many rows (tweets) there are in the dataset; there are 50 tweets in total. Finally, we will show a single selected row of the data.

In [None]:
df = pd.read_excel('indiegogo_50_random.xls')
print '# of columns:', len(df.columns)
print '# of observations:', len(df)
df.head(1)

### Read in Excel File from Internet

In [None]:
df = pd.read_excel('http://social-metrics.org/wp-content/uploads/2016/06/indiegogo_50_random.xls')
print '# of columns:', len(df.columns)
print '# of observations:', len(df)
df.head(1)

### Inspect the Data

Descriptive or summary statistics are available through the *describe* command.

We can select a single variable:

In [None]:
df['amount_raised'].describe()

<br>Or we can run descriptive statistics on all quantitative variables:

In [None]:
df.describe().T

### Variable Frequencies

To see the frequencies for the different values of a variable, use the *value_counts()* command.

In [None]:
df['category'].value_counts()

### Plot Some Data

#### Boxplot -- Useful for seeing visual depiction of descriptive (summary) statistics

In [None]:
df.boxplot('amount_raised', return_type='axes')

#### Bar Plot -- Useful for Seeing Individual Values

In [None]:
df.sort_values(by=['amount_raised'], ascending=False)['amount_raised'].plot(kind='bar')

#### Histogram -- Useful for Plotting Distribution 

In [None]:
df['amount_raised'].hist(bins=20)

### Generate New Variables
Let's create a new variable to see what proportion of the fundraising goal was actually earned.

In [None]:
df['percent_earned'] = df['amount_raised']/df['funding_goal']

In [None]:
df['percent_earned'].describe()

In [None]:
df.boxplot('percent_earned', return_type='axes')

<br>Note the strange distribution

In [None]:
df['percent_earned'].plot.hist(bins=100)