# Introduction to Data Cleaning
This notebook will guide you through the process of cleaning and preparing the weather dataset for analysis.



## Step 1: Load and Explore the Dataset
### Task:
- Import necessary libraries (e.g., pandas).
- Load the weather dataset into a DataFrame.
- Explore the structure and summary statistics of the dataset.


In [16]:
import pandas as pd
import numpy as np

url = "weather-raw.csv"
df = pd.read_csv(url)

df.head(20)

Unnamed: 0,id,year,month,element,d1,d2,d3,d4,d5,d6,...,d22,d23,d24,d25,d26,d27,d28,d29,d30,d31
0,MX17004,2010,1,tmax,,,,,,,...,,,,,,,,,27.8,
1,MX17004,2010,1,tmin,,,,,,,...,,,,,,,,,14.5,
2,MX17004,2010,2,tmax,,27.3,24.1,,,,...,,29.9,,,,,,,,
3,MX17004,2010,2,tmin,,14.4,14.4,,,,...,,10.7,,,,,,,,
4,MX17004,2010,3,tmax,,,,,32.1,,...,,,,,,,,,,
5,MX17004,2010,3,tmin,,,,,14.2,,...,,,,,,,,,,
6,MX17004,2010,4,tmax,,,,,,,...,,,,,,36.3,,,,
7,MX17004,2010,4,tmin,,,,,,,...,,,,,,16.7,,,,
8,MX17004,2010,5,tmax,,,,,,,...,,,,,,33.2,,,,
9,MX17004,2010,5,tmin,,,,,,,...,,,,,,18.2,,,,



## Step 2: Handle Missing Values
### Task:
- Identify missing values in the dataset.
- Apply appropriate methods to handle these missing values (e.g., imputation or removal).


In [18]:
df.isnull().sum()
df_na = df.fillna(0)



## Step 3: Correct Data Types
### Task:
- Inspect the data types of each column.
- Convert columns to their appropriate data types (e.g., dates, numeric values).


In [30]:

df_na["year_period"] = df["year"].astype(str).astype("period[Y]")
df_na.dtypes


id                    object
year                   int64
month                  int64
element               object
d1                   float64
d2                   float64
d3                   float64
d4                   float64
d5                   float64
d6                   float64
d7                   float64
d8                   float64
d9                   float64
d10                  float64
d11                  float64
d12                  float64
d13                  float64
d14                  float64
d15                  float64
d16                  float64
d17                  float64
d18                  float64
d19                  float64
d20                  float64
d21                  float64
d22                  float64
d23                  float64
d24                  float64
d25                  float64
d26                  float64
d27                  float64
d28                  float64
d29                  float64
d30                  float64
d31           

In [29]:
df_na.head()

Unnamed: 0,id,year,month,element,d1,d2,d3,d4,d5,d6,...,d23,d24,d25,d26,d27,d28,d29,d30,d31,year_period
0,MX17004,2010,1,tmax,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.8,0.0,2010
1,MX17004,2010,1,tmin,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.5,0.0,2010
2,MX17004,2010,2,tmax,0.0,27.3,24.1,0.0,0.0,0.0,...,29.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2010
3,MX17004,2010,2,tmin,0.0,14.4,14.4,0.0,0.0,0.0,...,10.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2010
4,MX17004,2010,3,tmax,0.0,0.0,0.0,0.0,32.1,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2010



## Step 4: Tidy the Dataset
### Task:
- Reshape the dataset as needed to ensure it adheres to tidy data principles.
- Use techniques such as melting or pivoting to organize variables into columns.



## Step 5: Validate and Save
### Task:
- Ensure that your cleaned dataset is free of inconsistencies.
- Save the cleaned dataset as `cleaned_weather.csv`.



## Step 6: Bonus Task - Outlier Detection
### Task:
- Identify any outliers in the cleaned dataset.
- Create a separate DataFrame containing these outliers and save it as `outliers.csv`.
