# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [101]:
# Import pandas and any other libraries you need here.

import pandas as pd
import numpy as np

# Create a new dataframe from your CSV

reviews = pd.read_csv('ecommerce.csv')

In [102]:
# Print out any information you need to understand your dataframe

reviews.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [103]:
reviews.shape

(23486, 11)

## Missing Data

Try out different methods to locate and resolve missing data.

In [104]:
# Try to find some missing data!

reviews.isna().sum()


Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

In [105]:
reviews['Recommended IND'].value_counts(dropna=False)

Recommended IND
1    19314
0     4172
Name: count, dtype: int64

In [112]:
cols = {"Title": "empty", "Review Text": "empty"}
reviews_nonull = reviews.fillna(value=cols)
reviews_nonull = reviews_nonull.dropna()

In [113]:
reviews_nonull.isna().sum()


Unnamed: 0                 0
Clothing ID                0
Age                        0
Title                      0
Review Text                0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64

Did you find any missing data? What things worked well for you and what did not?

There are missing values in a few different columns, two of which I probably don't care about ("title" and "review text"), and three that I probably do ("division name", "department name", and "class name"). Their existence was easily found using isna().sum()

I then replaced the null values in "title" and "review text" with the word "empty"; I am not reading those fields, so I don't want to lose the rest of the data in those rows.

Finally, I dropped the remaining rows that had nulls, which were all in the "division  name", "Department name", and "class name" columns. They were comparatively few in number (14 each, versus 23k+ total reviews), and replacing with placeholders seemed inappropriate since those might be data points we want to analyze on.

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [108]:
# Keep an eye out for outliers!

reviews_nonull["Positive Feedback Count"].describe()


count    23486.000000
mean         2.535936
std          5.702202
min          0.000000
25%          0.000000
50%          1.000000
75%          3.000000
max        122.000000
Name: Positive Feedback Count, dtype: float64

What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

I used describe() on the numerical value columns ("Rating", "age", "recommended IND", "positive feedback count") and did not find anything that seemed particularly likely to be an outlier to me. Ratings were all in the 0-5 range, and while the range of "positive feedback count" was pretty broad, it didn't seem unreasonably so.

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [109]:
# Look out for unnecessary data!
reviews_nonull.drop(columns=["Title", "Review Text", "Division Name", "Unnamed: 0"])

Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count,Department Name,Class Name
0,767,33,4,1,0,Intimate,Intimates
1,1080,34,5,1,4,Dresses,Dresses
2,1077,60,3,0,0,Dresses,Dresses
3,1049,50,5,1,0,Bottoms,Pants
4,847,47,5,1,6,Tops,Blouses
...,...,...,...,...,...,...,...
23481,1104,34,5,1,0,Dresses,Dresses
23482,862,48,3,1,0,Tops,Knits
23483,1104,31,3,0,1,Dresses,Dresses
23484,1084,28,3,1,2,Dresses,Dresses


Did you find any unnecessary data in your dataset? How did you handle it?

The "title" and "review text" seem unnecessary for quantitative analysis purposes. "Unnamed: 0" seems like a duplication of the index, and can also be dropped. "Division name" can also probably be dropped, as it is very general and unlikely to be very illuminating. Depending on what we're looking at, "age"  might also be a potential cut, and possibly either "department name" or "class name" (depending on if we want more or less specificity). These are in columns that can easily be dropped.

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [110]:
# Look out for inconsistent data!


Did you find any inconsistent data? What did you do to clean it?

I did not find any data that appeared inconsistent. 