## Handling Irrelevant Data

Youtube link  :  https://youtu.be/TBgRjL8tsYk

Irrelevant data are those that are not actually needed, and don’t fit under the context of the problem we’re trying to solve. Examples:

  - If we are analyzing data about the general health of the population, the phone number wouldn’t be necessary (column-wise).
  - If we are interested in only one particular country, we wouldn’t need to include all other countries (row-wise).
  - If we are sure that a piece of data is unimportant, we may drop it. Otherwise, explore the correlation matrix between feature variables.
  - If we notice that there is no correlation, we should ask someone who is domain expert. We never know, a feature that seems irrelevant, could be very relevant from a domain perspective such as a clinical perspective.

In [1]:
import pandas as pd

In [2]:
# Import the dataset
df_titanic = pd.read_csv("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Titanic/titanic_train.csv")

In [3]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We can see that "Name" and "Ticket" columns in the dataset seems like don't have to much impact on datasets. We can drop these irrelevant columns.

In [4]:
df_titanic.drop(["Name","Ticket"],inplace=True,axis=1)

In [5]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,7.25,,S
1,2,1,1,female,38.0,1,0,71.2833,C85,C
2,3,1,3,female,26.0,0,0,7.925,,S
3,4,1,1,female,35.0,1,0,53.1,C123,S
4,5,0,3,male,35.0,0,0,8.05,,S
