<!--- Mohammad Idrees Bhat | Mohammad Idrees Bhat --->

<div style="background-color: #add8e6; padding: 10px; height: 70px; border-radius: 15px;">
    <div style="font-family: 'Georgia', serif; font-size: 20px; padding: 10px; text-align: right; position: absolute; right: 20px;">
        Mohammad Idrees Bhat<br>
        <span style="font-family: 'Arial', sans-serif;font-size: 12px; color: #0a0a0a;">Tech Skills Trainer | AI/ML Consultant</span>
    </div>
</div>

<h1 style=" background-color: #002147; color: White; padding: 30px; text-align:center"> Application of NumPy and Pandas </h1>

<div style="background-color: lightgreen; color: black; padding: 10px;">
    <h1> Data Cleaning
</h1> </div>

# Importance of Data Cleaning

Data cleaning is a crucial step in the data analysis process, and here's why:

1. **Improves Data Quality**: Clean data ensures accuracy and reliability. When data is free of errors, duplicates, and inconsistencies, you can trust the insights derived from it.

2. **Enhances Decision-Making**: Businesses rely on data-driven decisions. Clean data provides a solid foundation for making informed choices, leading to better outcomes.

3. **Increases Efficiency**: Time spent on analyzing messy data can be significantly reduced. By cleaning the data beforehand, analysts can focus on deriving insights rather than fixing issues.

4. **Prevents Misleading Conclusions**: Outliers or erroneous entries can skew results, leading to incorrect interpretations. Data cleaning helps mitigate these risks, ensuring that conclusions are based on accurate representations.

5. **Facilitates Collaboration**: Clean data is easier to share and understand. When teams work with standardized and accurate datasets, collaboration becomes smoother, fostering better communication and teamwork.

6. **Regulatory Compliance**: Many industries have regulations regarding data accuracy and integrity. Cleaning data helps organizations comply with these standards, avoiding legal issues and penalties.

In summary, data cleaning is not just a technical necessity; it's an essential practice for achieving reliable, actionable insights that drive success.


### Skills Covered
1. Understanding data cleaning concepts and importance.
2. Using NumPy for basic array operations and handling missing data.
3. Utilizing Pandas for data manipulation, including filtering, grouping, and sorting.
4. Applying data transformation techniques such as renaming columns and handling duplicates.
5. Implementing data visualization to identify outliers and data distribution.

### Learning Outcomes
1. Ability to clean and preprocess data for analysis.
2. Proficiency in using NumPy and Pandas for data manipulation tasks.
3. Understanding of how to handle missing values effectively.
4. Skill in transforming data to make it suitable for analysis.
5. Knowledge of visualizing data to aid in cleaning processes.


<div style="background-color: lightgreen; color: black; padding: 4px;">
    <h4> Simple Data Cleaning with Pandas 
</h4> </div>

In [5]:
import pandas as pd

# load dataset
df= pd.read_csv('iris.csv') # Replace 'data.csv' with your actual file

In [23]:
print(" \n head \n")
# Display the first 5 rows
print(df.head()) 

print(" \n info \n")
# Display summary information
print(df.info())

print(" \n describe\n")
# Display statistical summary of numerical columns
print(df.describe())   

   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0   1            5.1           3.5            1.4           0.2  Iris-setosa
1   2            4.9           3.0            1.4           0.2  Iris-setosa
2   3            4.7           3.2            1.3           0.2  Iris-setosa
3   4            4.6           3.1            1.5           0.2  Iris-setosa
4   5            5.0           3.6            1.4           0.2  Iris-setosa
 
 info 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
None
 
 de

In [54]:
df.sample(5) # to randomly select a specified number of rows or a fraction of rows from a DataFrame

Unnamed: 0,Id,SepalLong,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
141,142,6.9,3.1,5.1,2.3,Iris-virginica
125,126,7.2,3.2,6.0,1.8,Iris-virginica
110,111,6.5,3.2,5.1,2.0,Iris-virginica
51,52,6.4,3.2,4.5,1.5,Iris-versicolor
80,81,5.5,2.4,3.8,1.1,Iris-versicolor


In [27]:
# Count missing values in each column
missing_values = df.isnull().sum()  
print(missing_values)

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64


<div style="background-color: lightblue; color: Black; padding: 10px; ">
    <h4> Handling Missing Data
</h1> </div>

<div style="background-color: lightblue; color: Black;  ">
    <h6> Remove rows with any missing values
</h1> </div>

In [32]:
df.dropna(inplace=True)  # Remove rows with any missing values

<div style="background-color: lightblue; color: Black;  ">
    <h6> Replace with 0
</h1> </div>

In [34]:
df.fillna(value=0, inplace=True)  # Replace missing values with 0

<div style="background-color: lightblue; color: Black;  ">
    <h6> Remove duplicate rows
</h1> </div>

In [None]:
df.drop_duplicates(inplace=True)  # Remove duplicate rows

<div style="background-color: lightblue; color: Black; padding: 10px; ">
    <h4> Rename column
</h1> </div>

In [48]:
df.rename(columns={'SepalLengthCm': 'SepalLong'}, inplace=True)  # Rename a column {'OldName': 'NewName'}

In [50]:
print(df.info())  # Check the DataFrame again after cleaning

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLong      150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
None


<div style="background-color: lightblue; color: white; padding: 10px; text-align: center;">
    <h1>_________________________________END________________________________
</h1> </div>

<div class="alert alert-block alert-warning">
    <b><font size="5"> Live Exercise</font> </b>
</div>

Now it's your turn!
### Task: Data Cleaning Exercise

### Objective
In this exercise, you will practice data cleaning techniques using NumPy and Pandas. You will work with a data of choice or oneset provided to you and perform various data cleaning methods

## Instructions
1. **Import the Necessary Libraries**
   - Write the code to import the Pandas and NumPy libraries.

2. **Load the Dataset**
   - Write the code to load the provided dataset into a Pandas DataFrame.

3. **Explore the Data**
   - Use the appropriate methods to view the first few rows of the dataset, obtain summary information, and describe the dataset’s statistical properties.

4. **Identify Missing Values**
   - Write code to check for any missing values in the dataset.

5. **Handle Missing Values**
   - Decide how to handle missing values (drop, fill with mean/median/mode) and write the corresponding code to apply your chosen method.

6. **Detect Outliers**
   - Use visualizations (like box plots) or statistical methods to iden t7 implement your chosen method.

8. **Remove Duplicates**
   - Write the code to check for and remove an8 duplicate rows in the DataFrame.

9. **Standardize Data Formats**
   - Ensure that categorical variables have consistent formats (e.g., lowercase). Write the code
9 standardize the specified columns.

10. **Sampling Data**
    - Write the code to take a random sample o10the cleaned data for further analysis.

11. **Save the Cleaned Data**
    - Write the code to save the cleane#d DataFrame to a new CSV file.

## Deliverables
- Submit the Jupyter notebook containing your code and comments explaining each stepke sure to document your choices and methods clearly.
port numpy as np




<div class="alert alert-block alert-warning">
    <b><font size="5"> Advanced Exercise 1 (Optional)</font> </b>
</div>

### Activity Motive: Enhancing the Titanic Dataset for Insightful Analysis

In this activity, you will clean and manipulate the Titanic dataset to prepare it for analysis. This exercise is crucial as it helps you practice essential data cleaning skills and gain insights from historical data, which can inform future decisions in maritime safety.

### Instructions:

1. **Load the Titanic Dataset**: Use Pandas to load the dataset into a DataFrame.
2. **Inspect the Data**: Use methods like `.head()`, `.info()`, and `.describe()` to understand the structure and content of the dataset.
3. **Handle Missing Values**: Identify columns with missing values and decide on an appropriate strategy (e.g., imputation or removal).
4. **Detect and Handle Outliers**: Analyze numerical features for outliers and apply techniques to manage them.
5. **Feature Engineering**: Create new features that may enhance analysis, such as family size from 'SibSp' and 'Parch'.
6. **Convert Categorical Variables**: Convert categorical variables into numerical formats using one-hot encoding or label encoding.
7. **Sample the Data**: Use the `.sample()` method to randomly select a subset of the data for analysis.

### Suggested Dataset:
- [Titanic Dataset from Kaggle](https://www.kaggle.com/c/titanic/data)



<div class="alert alert-block alert-warning">
    <b><font size="5"> Advanced Exercise 2 (Optional)</font> </b>
</div>

### Next Steps after Data Cleaning and Manipulation:

1. **Exploratory Data Analysis (EDA)**:
   - Visualize the cleaned data using libraries like Matplotlib or Seaborn.
   - Analyze relationships between features (e.g., survival rates based on gender, class, or age).
   - Create visual plots like histograms, box plots, and heatmaps to understand feature distributions and correlations.

2. **Feature Selection**:
   - Identify the most important features for your analysis or model by evaluating feature importance, correlations, or using statistical methods.
   - Consider dimensionality reduction techniques like PCA if dealing with high-dimensional data.

3. **Modeling**:
   - Use the cleaned and prepared dataset to build predictive models using libraries such as Scikit-learn.
   - Try different algorithms like logistic regression, decision trees, or random forests to pred
  
 4. Evaluation

- **Split the dataset into training and testing sets**:  
   This helps to assess how well the model generalizes to unseen data. Usually, a common split is 80% of the data for training and 20% for testing.
   
- **Evaluate model performance**:  
   After training, the model should be evaluated on the test data using various metrics such as:
   - **Accuracy**: Measures the proportion of correct predictions out of all predictions.
   - **Precision**: Focuses on the proportion of true positives out of all positive predictions.
   - **Recall**: Measures the proportion of true positives out of all actual p new data.
r model interaction.


<div style="background-color: #002147; color: #fff; padding: 30px; text-align: center;">
    <h1>THANK YOU!
</h1> </div>

<div style="background-color: lightgreen; color: black; padding: 10px; ">
    <h4>Solution
</h1> </div>