# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [3]:
# Import pandas and any other libraries you need here.
import pandas as pd
import numpy as np
# Create a new dataframe from your CSV
data = pd.read_csv(r"C:\Users\lred1\Desktop\Launchcode\SQL\clone\data-analysis-projects\cleaning-data-with-pandas\exercises\Womens Clothing E-Commerce Reviews.csv")

In [4]:
# Print out any information you need to understand your dataframe
data.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## Missing Data

Try out different methods to locate and resolve missing data.

In [10]:
# Try to find some missing data!
missing_cells = pd.isnull(data).sum()
total_missing = missing_cells.sum()

print("missing_cells", total_missing )

missing_title_rows = data[data['Title'].isnull()]
print("Missing Title count:", len(missing_title_rows))


missing_cells 4697
Missing Title count: 3810


Did you find any missing data? What things worked well for you and what did not?

In [None]:
# Respond to the above questions here:
Yes, I found missing data in the given csv file.
There were 4,697 total missing cells in the csv file, specifically Title column has 3,810 missing records.

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [13]:
# Keep an eye out for outliers!
1.Positive Feedback
Most values are between 0 and 3.
The maximum value is 122

2. Age
Most customers are between 34 and 52 years old.
The maximum age is 99

What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here:
1. Used select_dtypes() to identify numeric columns
This helped to find out columns where outliers actually make sense (Age, Rating, Feedback Count).

2. Using describe() for summary statistics
The describe() method provided
min / max, mean, standard deviation and quartiles (25%, 50%, 75%)

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [None]:
# Look out for unnecessary data!
The Unnamed column would not be not useful because it did not provide any meaningful information for analysis.

Did you find any unnecessary data in your dataset? How did you handle it?

In [19]:
# Make your notes here.
#Yes I found the unnamed column and dropped the column
data = data.drop(columns=["Unnamed: 0"])



## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [22]:
# Look out for inconsistent data!
# 1. Count extra spaces (leading or trailing)
extra_spaces = (data['Title'] != data['Title'].str.strip()).sum()
print("Extra spaces in Title:", extra_spaces)

# 2. Count mixed capitalization (not all lower or all upper)
mixed_caps = (data['Title'] != data['Title'].str.lower()).sum()
print("Mixed capitalization in Title:", mixed_caps)

# 3. Count missing values in Title
missing_titles = data['Title'].isna().sum()
print("Missing Title values:", missing_titles)


Extra spaces in Title: 3810
Mixed capitalization in Title: 23442
Missing Title values: 3810


Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here!
Yes, I found some inconsistent data in the dataset.
Mixed capitalization, extra spaces at the beginning or end of text and missing text values (especially in the Title column)