<h1><span style="color: lightblue;">Performing Data Wrangling ⚡️</span></h1>

## 🌟 Step 1
<h2><span style="color: yellow;">Data Gathering 🌎 </span></h2> 

   From various sources :<br>
- `CSV files` <br>

In [1]:
import pandas as pd
import numpy as np

In [3]:
titanic_data = pd.read_csv('Titanic-Dataset.csv')
titanic_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## 🌟 Step 2
<h2><span style="color: yellow;">Data  Assessments 🌎 </span></h2>  
- In this step, the data is to be understood more deeply. Before implementing methods to clean it, you will definitely need to have a better idea about what the data is about.<br>
- Basically a whole summary of data.<br>
- Data assessment is often an iterative process.<br>
    
<h2><span style="color: red;"> 🛑 Step 1 Discover </span></h2>  
- View datasets.<br>
- Check shape of the data.<br>
    
### 🔥 Automatic Assessments 
- Programmatic.<br>
- Using Pandas.<br>
  - `head and tail`
  - `describe`
  - `sample`
  - `info`
  - `isnull`
  - `duplicated`

In [34]:
titanic_data.sample(50)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
344,345,0,2,"Fox, Mr. Stanley Hubert",male,36.0,0,0,229236,13.0,,S
516,517,1,2,"Lemore, Mrs. (Amelia Milley)",female,34.0,0,0,C.A. 34260,10.5,F33,S
211,212,1,2,"Cameron, Miss. Clear Annie",female,35.0,0,0,F.C.C. 13528,21.0,,S
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
403,404,0,3,"Hakkarainen, Mr. Pekka Pietari",male,28.0,1,0,STON/O2. 3101279,15.85,,S
766,767,0,1,"Brewe, Dr. Arthur Jackson",male,,0,0,112379,39.6,,C
430,431,1,1,"Bjornstrom-Steffansson, Mr. Mauritz Hakan",male,28.0,0,0,110564,26.55,C52,S
115,116,0,3,"Pekoniemi, Mr. Edvard",male,21.0,0,0,STON/O 2. 3101294,7.925,,S
53,54,1,2,"Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkin...",female,29.0,1,0,2926,26.0,,S


In [14]:
titanic_data[titanic_data['Survived'] == 1].count()

PassengerId    342
Survived       342
Pclass         342
Name           342
Sex            342
Age            290
SibSp          342
Parch          342
Ticket         342
Fare           342
Cabin          136
Embarked       340
dtype: int64

In [17]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [27]:
num_duplicates = titanic_data.duplicated().sum()
num_duplicates

0

In [33]:
titanic_data['Age'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 891 entries, 0 to 890
Series name: Age
Non-Null Count  Dtype  
--------------  -----  
714 non-null    float64
dtypes: float64(1)
memory usage: 7.1 KB


### 🔥 Manual Assessments 
- Export data into excel sit and watch for hours to the data or ooking through the data manually in google sheets.

In [6]:
with pd.ExcelWriter('titanic_dataset.xlsx') as writer:
  titanic_data.to_excel(writer,sheet_name='titanic_data')

In [None]:
<h2><span style="color: red;"> 🛑 Step 2 Document </span></h2> 
- Summary<br>
- Address issues within the dataset combine and make documents<br>

### 1. 🟨 Write a summary for your data
- **The Titanic Tragedy Dataset**
- The RMS Titanic set sail on its maiden voyage in 1912, only to become infamous for its tragic sinking. This dataset provides insights into the passengers aboard the ill-fated ship.
- **Dataset Overview**: Contains information about 891 passengers, including their survival status.
- **Survival Statistics**: Of the 891 passengers, 549 did not survive, while 342 survived the disaster.
- **Passenger Information**: The dataset includes various details about the passengers, such as their age, class, and ticket information, providing a glimpse into the demographics and circumstances of those on board.

### 2. 🟨 Column Description
- **<span style="color: red;">💥 Table</span>** - `titanic_dataset`<br>
  - **PassengerId**: Unique identifier for each passenger.
  - **Survived**: Survival status (0 = No, 1 = Yes).
  - **Pclass**: Passenger class (1, 2, or 3), indicating the level of luxury and accommodation on the ship.
  - **Name**: Name of the passenger (first name ,last name ,nickname as well).
  - **Sex**: Gender of the passenger (male or female).
  - **Age**: Age of the passenger at the time of the voyage.
  - **SibSp**: Number of siblings or spouses aboard the Titanic.
  - **Parch**: Number of parents or children aboard the Titanic.
  - **Ticket**: Ticket number assigned to the passenger.
  - **Fare**: Amount paid for the ticket.
  - **Cabin**: Cabin number where the passenger stayed (if available).
  - **Embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

### 3. 🟨 Add any additional information
- **Survival Rate**: Provides insights into survival rates across different passenger classes and demographics.

### 4. 🟨 Issues with the Dataset 🛑
#### ⭐️ Step 1: Dirty Data
- **Duplicated Data**: 
  - The dataset may contain duplicate records, which can skew analysis and results.
- **Missing Data**: 
  - Some columns have missing values (e.g., `Age`, `Cabin`, `Embarked`), which need to be handled to ensure completeness.
- **Corrupt Data**: 
  - Entries like `NaN` in `Age`, `Cabin`, or `Embarked` might indicate data corruption or incomplete data collection.
- **Inaccurate Data**: 
  - Potential inaccuracies in fields like `Fare`, `Age`, or `Cabin` could impact the analysis.

#### ⭐️ Step 2: Messy Data
- **Structural Issues**: 
  - Ensure that each variable (column) is clearly defined and correctly placed in its respective column.
  - Each observation (row) should be distinct and correctly formatted.
  - The dataset should be organized into a single table where each column represents a specific variable, each row represents an observation, and the table contains the entire dataset.

### 5. 🟨 Providing Solutions to Issues within the Dataset 🛑
- **<span style="color: red;">💥 Table</span>** - `titanic_dataset`<br>
1. **Duplicated Data**:
   - **Solution**: Identify and remove duplicate rows. Use data processing tools to check for duplicates and eliminate them. In Python with Pandas, you can use:
     ```python
     titanic_dataset = titanic_dataset.drop_duplicates()
     ```

2. **Missing Data**:
   - **Solution**:
     - **Imputation**: Fill missing values with appropriate values (e.g., median for numerical columns, mode for categorical columns). For `Age`, you might use the median or mean of the column. For categorical data like `Embarked`, you could use the mode.
       ```python
       titanic_dataset['Age'].fillna(titanic_dataset['Age'].median(), inplace=True)
       titanic_dataset['Embarked'].fillna(titanic_dataset['Embarked'].mode()[0], inplace=True)
       ```
     - **Deletion**: Remove rows or columns with excessive missing data if imputation is not feasible or if it would introduce significant bias.

3. **Corrupt Data**:
   - **Solution**: Identify and address anomalies or incorrect values.
     - **Check and Correct Errors**: For columns like `Fare`, ensure there are no negative values or extreme outliers unless they are valid. 
       ```python
       titanic_dataset = titanic_dataset[titanic_dataset['Fare'] >= 0]
       ```
     - **Standardization**: Ensure consistent formatting (e.g., currency symbols removed from `Fare` values).

4. **Inaccurate Data**:
   - **Solution**: Validate data accuracy by cross-checking with reliable sources or domain knowledge.
     - **Consistency Checks**: Verify that values make sense in the context (e.g., `Age` should be within a reasonable range).
     - **Manual Review**: For critical columns, conduct a manual review or perform data validation checks to catch inaccuracies.

5. **Structural Issues**:
   - **Solution**:
     - **Ensure Proper Column and Row Structure**: Verify that each column represents a unique variable and each row represents a single observation.
     - **Normalization**: If any columns contain nested data or multiple values, normalize them into separate columns or tables as needed.
     - **Consistency**: Check for consistency in data types across columns and ensure no mixed types within a single column.

## 🌟 Step 3
<h2><span style="color: yellow;">Data Cleaning or Data Quality Dimensions✨</span></h2> 

- Follow the same order of steps below.<br>
- Each step include following steps.<br>
   `Define` ,`Code` ,`Test` <br>

In [74]:
titanic_data_df = titanic_data.copy()
titanic_data_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


`Always make sure to create a copy of your pandas dataframe before you start the cleaning process`

In [75]:
# Fill missing Age values with median
titanic_data_df['Age'].fillna(titanic_data_df['Age'].median(), inplace=True)

# Fill missing Embarked values with the most common value
titanic_data_df['Embarked'].fillna(titanic_data_df['Embarked'].mode()[0], inplace=True)

# Indicate missing Cabin values as 'Unknown'
titanic_data_df['Cabin'].fillna('Unknown', inplace=True)

In [77]:
titanic_data_df['Name'].sample(50)

130                                 Drazenoic, Mr. Jozef
92                           Chaffee, Mr. Herbert Fuller
167      Skoog, Mrs. William (Anna Bernhardina Karlsson)
701                     Silverthorne, Mr. Spencer Victor
620                                  Yasbeck, Mr. Antoni
584                                  Paulner, Mr. Uscher
123                                  Webber, Miss. Susan
283                           Dorking, Mr. Edward Arthur
690                              Dick, Mr. Albert Adrian
415              Meek, Mrs. Thomas (Annie Louise Rowley)
560                             Morrow, Mr. Thomas Rowan
737                               Lesurer, Mr. Gustave J
875                     Najib, Miss. Adele Kiamie "Jane"
233                       Asplund, Miss. Lillian Gertrud
173                            Sivola, Mr. Antti Wilhelm
540                              Crosby, Miss. Harriet R
713                           Larsson, Mr. August Viktor
779    Robert, Mrs. Edward Scot

In [81]:
titanic_data_df.drop(columns='Name',inplace=True)

In [83]:
titanic_data_df.drop_duplicates(inplace=True)

In [84]:
titanic_data_df['Age'] = titanic_data_df['Age'].astype(int)

In [87]:
titanic_data_df.drop(['PassengerId', 'Ticket', 'Cabin'], axis=1, inplace=True)

## 🌟 Step 4
<h2><span style="color: yellow;">Feature Engineering ✨</span></h2> 

In [78]:
import re

def separate_name(full_name):
    # Regular expression to capture the three parts
    pattern = r'^([^,]+), ([^(]+)(?:\(([^)]+)\))?$'
    match = re.match(pattern, full_name)
    
    if match:
        name = match.group(1).strip()
        sure_name = match.group(2).strip()
        nickname = match.group(3).strip() if match.group(3) else None
        return (name, sure_name, nickname)
    else:
        return (None, None, None)

In [79]:
titanic_data_df['Name_1'] = titanic_data_df['Name'].apply(lambda x: separate_name(x)).apply(lambda x:x[0])

In [80]:
titanic_data_df['Sure Name'] = titanic_data_df['Name'].apply(lambda x: separate_name(x)).apply(lambda x:x[1])
titanic_data_df['Nick Name'] = titanic_data_df['Name'].apply(lambda x: separate_name(x)).apply(lambda x:x[2])

In [85]:
titanic_data_df[titanic_data_df['Fare'] == 0]

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name_1,Sure Name,Nick Name
179,180,0,3,male,36,0,0,LINE,0.0,Unknown,S,Leonard,Mr. Lionel,
263,264,0,1,male,40,0,0,112059,0.0,B94,S,Harrison,Mr. William,
271,272,1,3,male,25,0,0,LINE,0.0,Unknown,S,Tornquist,Mr. William Henry,
277,278,0,2,male,28,0,0,239853,0.0,Unknown,S,Parkes,"Mr. Francis ""Frank""",
302,303,0,3,male,19,0,0,LINE,0.0,Unknown,S,Johnson,Mr. William Cahoone Jr,
413,414,0,2,male,28,0,0,239853,0.0,Unknown,S,Cunningham,Mr. Alfred Fleming,
466,467,0,2,male,28,0,0,239853,0.0,Unknown,S,Campbell,Mr. William,
481,482,0,2,male,28,0,0,239854,0.0,Unknown,S,Frost,"Mr. Anthony Wood ""Archie""",
597,598,0,3,male,49,0,0,LINE,0.0,Unknown,S,Johnson,Mr. Alfred,
633,634,0,1,male,28,0,0,112052,0.0,Unknown,S,Parr,Mr. William Henry Marsh,


In [86]:
median_fare = titanic_data_df['Fare'].median()
titanic_data_df.loc[titanic_data_df['Fare'] == 0, 'Fare'] = median_fare

- Create New Features:
- `Family Size` : Combine SibSp and Parch to create a new feature FamilySize.

In [88]:
titanic_data_df['FamilySize'] = titanic_data_df['SibSp'] + titanic_data_df['Parch'] + 1

IsAlone: Create a feature to indicate if a passenger is alone.

In [95]:
titanic_data_df['IsAlone'] = 1
titanic_data_df.loc[titanic_data_df['FamilySize'] > 1, 'IsAlone'] = 0

In [97]:
bins = [0, 12, 20, 40, 60, 80]
labels = ['Child', 'Teen', 'Adult', 'Middle-Aged', 'Senior']
titanic_data_df['AgeGroup'] = pd.cut(titanic_data_df['Age'], bins, labels=labels)