# Getting started with Python for data science challenges.

## Why python ?
* Because of its rich libraries. [this does not mean others dont have ]
* You can use Python when your data analysis tasks need to be integrated with web apps.
* Growing user base.

There are few disadvantages of Python if you compare it with contenders like R. To qoute one, R's visulization libraries
are better but Python's visualization libraries like Seaborn are filling up this gap.

So it's upto you on which one you would go for. 

## Getting started with the tutorial.
* We will be using Kaggle's <a href="https://www.kaggle.com/c/titanic">titanic challenge</a> as a reference for the tutorial. 
* We will go over
    * How to read data.
    * Basics of Numpy and Pandas.
    * How to use basic models defined in Sci-Kit learn library.
    


## Reading Data using Pandas.


In [33]:
# import required library
import pandas as pd # now you can refer to pandas library as 'pd' from here on.
import numpy as np

df = pd.read_csv('train.csv', header=0)
df
# Below is the whole data frame. Pandas loads the data as into a frame which is basically similar to that of a table 
# in RDBMS.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [18]:
#to choose top 5 rows we use head (similarly tail for last 5.)
df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [19]:
# to know the data types of all the columns in the data frame we use
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [20]:
# to know info about the number of rows and null values we use
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [21]:
# to know the basic statistics about the data frame. [report only for numerical columns.]
df.describe()
#you could know stats like min of a column and others.

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## Referencing and indexing in Pandas

In [22]:
#show first 10 age column values. syntax: df['column_name'][row_values] or df.column_name['row_vlaues']
df['Age'][0:10] #returns a series (series is a pandas variant of tupple.)

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

In [23]:
#using above to compute mean of Age column
df['Age'].mean()
#this ignores 'Nan' Values.

29.69911764705882

In [24]:
# to select multiple columns at a time we provide the list of column names instead of one value
df[['Sex','Pclass','Age']][0:10]
#df[['Sex','Pclass','Age']] would return all the 891 rows.

Unnamed: 0,Sex,Pclass,Age
0,male,3,22.0
1,female,1,38.0
2,female,3,26.0
3,female,1,35.0
4,male,3,35.0
5,male,3,
6,male,1,54.0
7,male,3,2.0
8,female,3,27.0
9,female,2,14.0


In [25]:
# one of the most used operation is filtering.
# it is similar to that of SQL's where cluase.
# to see the passenger details whose age is greater than 70
df[df['Age']>70] #df[condition]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S


In [26]:
#projecting certain columns of the selection. 
df[df['Age'] > 60][['Sex', 'Pclass', 'Age', 'Survived']]
# we can observe that males in class 1 are most likely to die.

Unnamed: 0,Sex,Pclass,Age,Survived
33,male,2,66.0,0
54,male,1,65.0,0
96,male,1,71.0,0
116,male,3,70.5,0
170,male,1,61.0,0
252,male,1,62.0,0
275,female,1,63.0,1
280,male,3,65.0,0
326,male,3,61.0,0
438,male,1,64.0,0


In [27]:
# We need to filter null values before we could apply any algorithm.
# to choose null values we use the "where" filtering as below.
df[df['Age'].isnull()][['Sex', 'Pclass', 'Age']][0:10]
#you could use a negation in the condition to get only rows with non null values.

Unnamed: 0,Sex,Pclass,Age
5,male,3,
17,male,2,
19,female,3,
26,male,3,
28,female,3,
29,male,3,
31,female,1,
32,female,3,
36,male,3,
42,male,3,


In [28]:
# Applying multiple conditions at a time. (Below computes number of passengers in each class.)
for i in range(1,4):
    print(i, len(df[ (df['Sex'] == 'male') & (df['Pclass'] == i) ]))

1 122
2 108
3 347


In [29]:
#visulizing data using Pandas's histogram
import pylab as P
#plot histogram on Age column but before that we drop 'Nan' values.
#basic command df['Column_name'].hist()
df['Age'].dropna().hist(bins=16, range=(0,80), alpha = .5)
P.show()

## Formating and Cleaning data 

In [34]:
# Not all features are in the required format. Ex: Gender is in string format(female,male). 
# We shall make it into nominal values.
# we add a new column 'Gender' 
#df['Gender'] = df['Sex'].map( lambda x: x[0].upper() )
df['Gender'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
# to check the modification we use the describe method. 
# we see that all the 891 rows have been updated correctly.
df.describe()
#print(df.head())

In [31]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Gender
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


In [35]:
#filling null values with some appropriate values

#we fill age's Nan values with median age for given gender and passenger class.

# first we use numpy's array to compute median ages for given classes.

median_ages = np.zeros((2,3))
for i in range(0, 2):
    for j in range(0, 3):
        median_ages[i,j] = df[(df['Gender'] == i) &  (df['Pclass'] == j+1)]['Age'].dropna().median()
        
#median_ages
#to avoid losing original data we make a copy of it
df['AgeFill'] = df['Age']
df[ df['Age'].isnull() ][['Gender','Pclass','Age','AgeFill']].head(10)

Unnamed: 0,Gender,Pclass,Age,AgeFill
5,1,3,,
17,1,2,,
19,0,3,,
26,1,3,,
28,0,3,,
29,1,3,,
31,0,1,,
32,0,3,,
36,1,3,,
42,1,3,,


In [36]:
# we now fill each Nan value of 'AgeFill' column with respective classes's median.
for i in range(0, 2):
    for j in range(0, 3):
        df.loc[ (df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j+1),\
                'AgeFill'] = median_ages[i,j]
df[ df['Age'].isnull() ][['Gender','Pclass','Age','AgeFill']].head(10)

Unnamed: 0,Gender,Pclass,Age,AgeFill
5,1,3,,25.0
17,1,2,,30.0
19,0,3,,21.5
26,1,3,,25.0
28,0,3,,21.5
29,1,3,,25.0
31,0,1,,35.0
32,0,3,,21.5
36,1,3,,25.0
42,1,3,,25.0


## Feature Engineering.

In [37]:
# we can try impact of different combinations of features
df['Age*Class'] = df.AgeFill * df.Pclass
import pylab as p
df['Age*Class'].hist()
p.show()

In [38]:
# Dropping columns
df = df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1) 
# Dropping rows with Nan values.
df = df.dropna()

In [39]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Gender,AgeFill,Age*Class
count,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0,714.0
mean,448.582633,0.406162,2.236695,29.699118,0.512605,0.431373,34.694514,0.634454,29.699118,61.938151
std,259.119524,0.49146,0.83825,14.526497,0.929783,0.853289,52.91893,0.481921,14.526497,34.379609
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0,0.0,0.42,0.92
25%,222.25,0.0,1.0,20.125,0.0,0.0,8.05,0.0,20.125,38.0
50%,445.0,0.0,2.0,28.0,0.0,0.0,15.7417,1.0,28.0,58.0
75%,677.75,1.0,3.0,38.0,1.0,1.0,33.375,1.0,38.0,81.0
max,891.0,1.0,3.0,80.0,5.0,6.0,512.3292,1.0,80.0,222.0


In [40]:
# to look at the columns of the data frame.
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare',
       'Gender', 'AgeFill', 'Age*Class'],
      dtype='object')

## Fitting a linear model.

In [41]:
#We first split available training data into validation and training data.
train_data=df.ix[0:500,]
validate_data=df.ix[500:,]

In [42]:
#importing sklearn library.
from sklearn import linear_model

#create a linear regression model
lr=linear_model.LinearRegression()
trainx=train_data.ix[:,2:]
trainy=train_data.ix[:,1]

#fit the linear model.
lr.fit(trainx,trainy)
#access the coeffcients using 
print("w values for the linear model",lr.coef_)

testx=validate_data.ix[:,2:]
testy=validate_data.ix[:,1]
ypredicted=lr.predict(testx)
predicted_values=pd.DataFrame(ypredicted)

def fun1(x):
    if x>0.5:
        return 1
    else:
        return 0

predicted_labels = predicted_values[0].apply(fun1)

print(predicted_labels.head())
print(testy.head())

w values for the linear model [ -2.05532290e-01  -4.53611395e-03  -5.18681533e-02   1.30721132e-02
  -2.30157382e-04  -5.20276835e-01  -4.53611395e-03   1.32442425e-03]
0    0
1    1
2    1
3    1
4    0
Name: 0, dtype: int64
500    0
501    0
503    0
504    1
505    0
Name: Survived, dtype: int64
