<a href="https://colab.research.google.com/github/VIJAYKUMAR181/project1/blob/master/Copy_of_Data_cleaning_and_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Importing the Data:
The first step in data cleaning and preprocessing is importing the data. The data may come in various formats, such as CSV, Excel, SQL, or JSON. Once you have imported the data, you should explore it to understand its structure and content. You can use Python libraries like pandas and numpy to work with the data.

You can import the Titanic dataset into Python using the pandas library with the following code:

In [None]:
import pandas as pd
titanic = pd.read_csv('https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv')

In [None]:
titanic.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


## 2. Handling Missing Data:

Missing data is a common issue in datasets. There are various ways to handle missing data, depending on the nature and extent of the missingness. Some common techniques include:

* Dropping rows or columns with missing values
* Imputing missing values with the mean, median, or mode of the column
* Imputing missing values with the most frequent value
* Imputing missing values with the value from the previous or next row (for time series data)

The Titanic dataset contains missing data in the **Age** column. You can use the **fillna()** method in pandas to fill in the missing values with the median age with the following code:

In [None]:
titanic['Age'].fillna(titanic['Age'].median(), inplace=True)

## 3. Handling Outliers:
Outliers are extreme values that deviate from the normal range of values in a dataset. Outliers can distort the results of data analysis, and should be identified and treated accordingly. Some common techniques to handle outliers include:
* Visual inspection of the data to identify outliers
* Using statistical methods like z-score or IQR (Interquartile Range) to identify outliers
* Replacing outliers with the mean, median, or mode of the column
* Removing outliers from the dataset

Suppose your dataset contains outliers in the **`Fare`** column. You can use the **`clip()`** method in pandas to clip the values to a specified range with the following code:

In [None]:
titanic['Fare'] = titanic['Fare'].clip(lower=0, upper=200)

## 4. Data Transformation:
Data transformation involves converting the data into a suitable format for analysis. Some common techniques of data transformation include:
* Scaling the data to normalize it and avoid biases
* Applying log or exponential transformations to the data to reduce skewness
* Converting categorical data into numerical data using encoding techniques like one-hot encoding or label encoding
* Binning or grouping data to simplify analysis

Suppose you want to create a new variable called **`FamilySize`** which is the sum of the **`SibSp`** and **`Parch`** variables plus one (for the passenger). You can use the following code:

In [None]:
titanic['FamilySize'] = titanic['Siblings/Spouses Aboard'] + titanic['Parents/Children Aboard'] + 1

## 5. Handling Duplicate Data:
Duplicate data can be a problem in datasets, as it can distort the results of analysis. Some common techniques to handle duplicate data include:
* Identifying and removing exact duplicates
* Identifying and removing near duplicates using fuzzy matching techniques
* Merging duplicate records to create a single record

Suppose your dataset contains duplicate records due to multiple entries for the same passenger. You can drop duplicate records using the following code:

In [None]:
titanic.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'Siblings/Spouses Aboard',
       'Parents/Children Aboard', 'Fare', 'FamilySize'],
      dtype='object')

In [None]:
titanic.drop_duplicates(subset=['Name', 'Sex', 'Fare'], inplace=True)

## 6. Data Integration:
Data integration involves combining data from multiple sources into a single dataset. Some common techniques of data integration include:
* Joining tables in SQL databases based on common keys
* Merging data frames in pandas based on common columns
* Concatenating data from multiple files or folders

Suppose you have another dataset of Titanic passengers with the same format and you want to combine the two datasets. You can concatenate the two datasets using the **`concat()`** method in pandas with the following code:

In [None]:
new_titanic = pd.read_csv('new_titanic.csv')
combined_data = pd.concat([titanic, new_titanic], axis=0)

FileNotFoundError: [Errno 2] No such file or directory: 'new_titanic.csv'

These are some of the common techniques used in data cleaning and preprocessing. The specific techniques used will depend on the nature of the data and the goals of the analysis. Remember that data cleaning and preprocessing is an iterative process, and may require multiple rounds of cleaning and refinement. With a clean and well-prepared dataset, you can carry out accurate and reliable data analysis to make informed business decisions.

In [None]:
combined_data.head()