# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

**Dataset Information:**
- **Dataset Name:** Women's Clothing E-Commerce Reviews
- **File:** `Womens Clothing E-Commerce Reviews.csv`
- **Source:** This dataset contains reviews written by customers and includes features like ratings, review text, product categories, and customer information.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [1]:
# Import pandas and any other libraries you need here.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt


# Create a new dataframe from your CSV

df = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")

In [2]:
# Print out any information you need to understand your dataframe
df.info()
#df_reviews = df.set_index("Clothing ID")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


#**Missing** Data

#Try out different methods to locate and resolve missing data.

In [3]:
# Try to find some missing data!



df.dropna(how= "all")
df.dropna(axis= "columns", how= "all")
df.fillna("unknown", inplace=True)

df.isnull().sum()

Unnamed: 0                 0
Clothing ID                0
Age                        0
Title                      0
Review Text                0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64

Did you find any missing data? What things worked well for you and what did not?

In [4]:
# Respond to the above questions here: Yes, there was missing data but no row had all the data missing.
# So kept all the data and replaced the missing data with the "unknown"

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [5]:
# Keep an eye out for outliers!

df.describe()
# df.plot.hist(column="Positive Feedback Count")

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,23486.0,23486.0,23486.0,23486.0,23486.0,23486.0
mean,11742.5,918.118709,43.198544,4.196032,0.822362,2.535936
std,6779.968547,203.29898,12.279544,1.110031,0.382216,5.702202
min,0.0,0.0,18.0,1.0,0.0,0.0
25%,5871.25,861.0,34.0,4.0,1.0,0.0
50%,11742.5,936.0,41.0,5.0,1.0,1.0
75%,17613.75,1078.0,52.0,5.0,1.0,3.0
max,23485.0,1205.0,99.0,5.0,1.0,122.0


What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here:
# The only outlier is 122 positive feedback but that could be correct data. Therefore will keep the data as it is.

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplicate column. Check out the dataset to see if there is any unnecessary data.

In [15]:
# Look out for unnecessary data!

df = df.drop(columns=["Unnamed: 0", "Title", "Review Text"])
df.fillna("unknown", inplace=True)
df.isnull().sum()


Clothing ID                0
Age                        0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64

In [56]:
df.duplicated().sum()
df.drop_duplicates(inplace=True)

Did you find any unnecessary data in your dataset? How did you handle it?

In [61]:
df

Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,4,1,0,Initmates,Intimate,Intimates
1,1080,34,5,1,4,General,Dresses,Dresses
2,1077,60,3,0,0,General,Dresses,Dresses
3,1049,50,5,1,0,General Petite,Bottoms,Pants
4,847,47,5,1,6,General,Tops,Blouses
...,...,...,...,...,...,...,...,...
23481,1104,34,5,1,0,General Petite,Dresses,Dresses
23482,862,48,3,1,0,General Petite,Tops,Knits
23483,1104,31,3,0,1,General Petite,Dresses,Dresses
23484,1084,28,3,1,2,General,Dresses,Dresses


In [None]:
# Make your notes here.
# The  Unnamed, Title and Review text columns do not have any relevant data for analysis.
# Dropped the duplicated rows. 

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [94]:
# Look out for inconsistent data!
df["Clothing ID"].value_counts()

Clothing ID
1078    677
862     558
1094    540
1081    433
829     430
       ... 
224       1
544       1
657       1
310       1
1069      1
Name: count, Length: 1206, dtype: int64

In [95]:
df.head()

# df['Division Name'].unique()
# df['Department Name'].unique()
df['Class Name'].unique()

array(['Intimates', 'Dresses', 'Pants', 'Blouses', 'Knits', 'Outerwear',
       'Lounge', 'Sweaters', 'Skirts', 'Fine gauge', 'Sleep', 'Jackets',
       'Swim', 'Trend', 'Jeans', 'Legwear', 'Shorts', 'Layering',
       'Casual bottoms', 'unknown', 'Chemises'], dtype=object)

Did you find any inconsistent data? What did you do to clean it?

In [96]:
# Make your notes here!
# I tried various techniques to find the inconsistencies but I didn't find one. 