#### This tutorial will show you different ways to import data to use with python

In [1]:
# numpy and pandas are both common python modules for data science
import numpy as np
import pandas as pd

##### Download your data set

###### **First find where you downloaded your data and define the file path or open a file handle **

In [4]:
# if you are a windows user, make sure to use a "raw" string path 
file_path = r'C:\Users\aregel\Desktop\FoCo_DS\train.csv'

In [10]:
# to create a file object you can use the following methods
file_handle = open(file_path, mode='r')

** For this tutorial using the file_path is equivelent to creating a file object in 'r' (raw) mode.  However, the file object has a lot of features that just defining the path doesn't.  If you are interested in the open() method, this website http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python goes into a lot of detail, including examples **

###### **Next, import the data as a pandas dataframe.  I like to go stright to a dataframe, because pandas has a lot of  built-in methods to handle different file extensions, like .csv and .xlsx **

In [11]:
# opening a csv file
data_1 = pd.read_csv(file_path)
# OR
data_2 = pd.read_csv(file_handle)


In [12]:
# opening an excel file
data_3 = pd.read_excel(file_path)
# OR
data_4 = pd.read_excel(file_handle)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


** The titanic data is very clean, so all of these import options will give you the same result.  However, there are a lot of attributes for both the read_csv and read_excel methods to help with cleaning up data.
read_excel doc: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
read_csv doc: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html **

#### You now have your data in a pandas dataframe
#### * Below are some simple and helpful built-in methods for looking at the data*

In [13]:
# shows you the first 5 rows
data_1.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [14]:
# returns summary statistics for the whole data frame
data_1.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [15]:
# returns all of the column names for the dataframe
list(data_1)

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

#### You can go from a pandas dataframe to a numpy array

In [16]:
# transforms the whole dataframe to a numpy array
data_1.as_matrix()

array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ..., 
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

In [17]:
# transforms only selected columns to a numpy array
data_1.as_matrix(columns = ['PassengerId', 'Survived', 'Pclass'])

array([[  1,   0,   3],
       [  2,   1,   1],
       [  3,   1,   3],
       ..., 
       [889,   0,   3],
       [890,   1,   1],
       [891,   0,   3]], dtype=int64)

#### You don't have to use pandas or numpy, but this is a beginner tutorial and pandas dataframes and numpy arrays have a lot of great methods to analyze data. The dataframe is also a really common datatype used in Scipy or Statsmodel machine learning libraries