# Titanic Dataset

## About the columns
1. pclass: A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower
2. age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
3. sibsp: The dataset defines family relations in this way..., # of siblings / spouses aboard the Titanic
-  Sibling = brother, sister, stepbrother, stepsister
-  Spouse = husband, wife (mistresses and fiancés were ignored)
4. parch: The dataset defines family relations in this way...,# of parents / children aboard the Titanic		
-  Parent = mother, father
-  Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.
5. sex	Sex	- male/female	
6. ticket	Ticket number	
7. fare	Passenger fare	
8. cabin	Cabin number	
9. embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton
10. Name
11. Passengerid
12. Survived

In [168]:
import pandas as pd

df = pd.read_csv('train.csv')
df.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


#### Preprocessing 'Name'
we will preprocess this as it contains some useful information , 'Mr', 'Mrs', 'Miss', etc

In [169]:
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [170]:
newdata = pd.DataFrame()
newdata['gender']= df['Sex']
newdata['honorifics'] = df['Name'].str.extract(r',\s(\w+)\.')
abnormal_honorifics_index = df.index[newdata['honorifics'].isna()]
newdata['honorifics'].unique()


array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', nan, 'Jonkheer'],
      dtype=object)

In [171]:
df.loc[abnormal_honorifics_index,'Name']

759    Rothes, the Countess. of (Lucy Noel Martha Dye...
Name: Name, dtype: object

We try to figure out the honorifics by ourselves for these abnormal cases

In [172]:
newdata.at[759,'honorifics']='Countess'

note: The above categories can be converted into following 4 ones
1. Mr- Don, Sir, Jonkheer
- based on gender- Rev, Dr
2. Mrs- Mme, Ms, Lady, Mlle, Countess
- based on gender- Rev, Dr
3. Miss-
4. Master- 
5. Military_Officer-   Major, Col, Capt

In [173]:
redundant_honorifics = ['Rev', 'Dr', 'Mr','Don','Sir','Jonkheer','Mrs','Mme','Ms','Lady','Mlle','Countess']

for index, value in newdata['honorifics'].items():
    
    if newdata.at[index,'honorifics'] in redundant_honorifics:
        if newdata.at[index,'gender'] == 'female':
            newdata.at[index,'honorifics'] = 'Mrs'
        else:
            newdata.at[index,'honorifics'] = 'Mr'

    if newdata.at[index,'honorifics'] in('Major','Col','Capt'):
        newdata.at[index,'honorifics'] = 'Military_Officer'
    if newdata.at[index,'honorifics'] == 'nan':
        print(f"Missing: {index}")


In [174]:
df['Name'] = newdata['honorifics']
df['Name'].unique()

array(['Mr', 'Mrs', 'Miss', 'Master', 'Military_Officer'], dtype=object)

We have preprocessed the data for 'Name'

In [175]:
df.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,5,2,,,,681.0,,147,3
top,,,,Mr,male,,,,347082.0,,B96 B98,S
freq,,,,532,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


### Now we deal with age
We will try to fill missing values with the average of corresponding honorific title

In [176]:
avg_by_title = df.groupby('Name')['Age'].transform('mean')
avg_by_title
# avg_age_by_honorific has the same length as the original DataFrame df. 
# Each element of this series corresponds to the mean age within the 'Honorifics' group of the respective row in df.

0      32.697816
1      35.713043
2      21.773973
3      35.713043
4      32.697816
         ...    
886    32.697816
887    21.773973
888    21.773973
889    32.697816
890    32.697816
Name: Age, Length: 891, dtype: float64

In [177]:
df['Age'].fillna(avg_by_title, inplace=True)
#  the null values in the 'Age' column will be replaced with the average age of their respective 'Honorifics' groups.
df.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,891.0,891.0,891.0,891.0,891.0,204,889
unique,,,,5,2,,,,681.0,,147,3
top,,,,Mr,male,,,,347082.0,,B96 B98,S
freq,,,,532,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.784724,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,13.278781,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,21.773973,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,30.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,35.713043,1.0,0.0,,31.0,,


Age column solved