## Handling Irrelevant Data


Irrelevant data are those that are not actually needed, and don’t fit under the context of the problem we’re trying to solve. Examples:

  - If we are analyzing data about the general health of the population, the phone number wouldn’t be necessary (column-wise).
  - If we are interested in only one particular country, we wouldn’t need to include all other countries (row-wise).
  - If we are sure that a piece of data is unimportant, we may drop it. Otherwise, explore the correlation matrix between feature variables.
  - If we notice that there is no correlation, we should ask someone who is domain expert. We never know, a feature that seems irrelevant, could be very relevant from a domain perspective such as a clinical perspective.

Not necessarily. The absence of a statistically significant correlation between a feature and a label (target variable) does not automatically imply that the feature is irrelevant. There are several important points to consider when assessing the relevance of a feature in a data analysis or machine learning context:

 - Statistical Correlation: Lack of a statistically significant correlation does not necessarily mean the feature is irrelevant. Correlation measures linear relationships, and there might be complex, nonlinear relationships that are not captured by correlation coefficients.

 - Domain Knowledge: It's crucial to consider domain knowledge and expertise when determining feature relevance. Some features may have theoretical importance or real-world significance even if they don't show a strong statistical correlation with the target variable.

 - Feature Importance: Machine learning models, especially tree-based models like decision trees and random forests, can provide insights into feature importance. Even if a feature has a weak linear correlation, it might still be important in the context of a specific model.

 - Interaction Effects: Features may interact with each other or exhibit interactions with the target variable. In such cases, it's important to consider feature interactions when assessing relevance.

 - Dimensionality Reduction: In high-dimensional datasets, irrelevant features can introduce noise and increase model complexity. Dimensionality reduction techniques like feature selection or feature engineering can help identify and remove irrelevant features.

 - Data Quality: Irrelevant features may be a result of data noise or errors. It's essential to ensure data quality and correctness before making conclusions about feature relevance.

 - Feature Engineering: Sometimes, irrelevant raw features can be transformed or combined to create new, more informative features.

 - Business Goals: The relevance of a feature can also depend on the specific goals of the analysis or machine learning task. A feature might be irrelevant for one objective but crucial for another.

In summary, while a lack of correlation between a feature and a label might raise questions about its relevance, it should not be the sole criterion for determining feature relevance. A comprehensive analysis that considers statistical measures, domain knowledge, and the specific context of the problem is essential to make informed decisions about feature relevance in data analysis and machine learning.

In [8]:
import pandas as pd

In [9]:
# Import the dataset
df_titanic = pd.read_csv("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Titanic/titanic_train.csv")

In [10]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [11]:
df_titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

We can see that "Name" and "Ticket" columns in the dataset seems like don't have to much impact on datasets. We can drop these irrelevant columns.

In [12]:
#Drop the two columns
df_titanic.drop(["Name","Ticket"],inplace=True,axis=1)

In [13]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,7.25,,S
1,2,1,1,female,38.0,1,0,71.2833,C85,C
2,3,1,3,female,26.0,0,0,7.925,,S
3,4,1,1,female,35.0,1,0,53.1,C123,S
4,5,0,3,male,35.0,0,0,8.05,,S


* Finally i push you to discuss with domain expert , to strengthen your domain knowledge so you can take the good decisions.

        END