# Preparing the Women's Clothing E-Commerce Review Dataset
_by Nick Brooks, Janurary 2018_

- [**Github**](https://github.com/nicapotato)
- [**Kaggle**](https://www.kaggle.com/nicapotato/)
- [**Linkedin**](https://www.linkedin.com/in/nickbrooks7)

In [1]:
# Packages
import pandas as pd

In [2]:
# Processing
data = pd.ExcelFile("Women's Retail/retail dataset.xlsx")
df = data.parse("Sheet1", header=1)
df = df.drop(["Unnamed: 18", "Unnamed: 19", "Unnamed: 20", "Unnamed: 21","Unnamed: 22","Unnamed: 23","Unnamed: 24","Unnamed: 25","Unnamed: 26"], axis = 1)
df = df.drop(["REVIEW_ID", "RATING_RANGE", "NUM_NEGATIVE_FEEDBACKS"], axis = 1)

# Anonymize Categorical Feature Values
df["DIVISION_NAME"].replace({'ANTHRO  INTIMATES (NA)': "Initmates",
                             "ANTHRO. WOMEN'S DIVISION (NA)":"General",
                             "ANTHRO. MISC. DIVISION (NA)":"General Petite"},inplace=True)
df["DEPARTMENT_NAME"]= df["DEPARTMENT_NAME"].str.replace(
    r"(WOMEN'S)|(NA)|[()]|(ANTHRO.)", '').str.strip().str.lower().str.capitalize()
df["CLASS_NAME"]= df["CLASS_NAME"].str.replace(
    r"[-]|(NA)|[()]|(ANTHRO.)", '').str.strip().str.lower().str.capitalize()

df.columns = ["Clothing ID", "Age", "IDD","Title","Review Text","Rating",
              "Recommended IND","Feedback Count","Positive Feedback Count","Division ID","Division Name",
              "Department ID","Department Name","Class ID","Class Name"]

# Replace mentions of Anthropologic with "Retailer"
df.Title = df.Title.str.replace(
    r'((?i)anthropologie|(?i)anthropology|(?i)anthro)',
    'Retailer').str.strip().str.lower().str.capitalize()
df["Review Text"] = df["Review Text"].str.replace(
    r'((?i)anthropologie|(?i)anthropology|(?i)anthro|(?i)anthr)',
    'Retailer').str.strip().str.lower().str.capitalize()
# Couldn't figure out silver bullet
# https://stackoverflow.com/questions/39768547/replace-whole-string-if-it-contains-substring-in-pandas

**Code Explanation:** <br>
The Read Data section uses a Pandas function to read and parse the first sheet of the excel formatted retail dataset. Then, blank columns are removed, aswell as the unused variables “REVIEW_ID” and “RATING_RANGE”. Finally, the Transform Depedent Variable to Binary section creates a new variable.

Next, the comments themselves must be cleaned and anonymized.

**Regular Expressions:**
- Replace() : This command swaps every occurence of the first dictionary key with the dictionary value.
- str.strip() : This command removes white space before and after the start of text.
- str.lower() : This command lower cases all the text.
- str.capitaize() : This command capitalizes the first character of sentences.

**Regex Compilers:**
- r" " : The "r" signifies a string object, so that /, \, and ' dont get interpret as they generally do in python.
- r"(WOMEN'S)|(NA)|[()]|(ANTHRO.)" : Targets strings that match those in the brackets (blabla). The | is an AND argument, enabling the chaining of multiple string targets. Finally, all these string parts are replaces with "" ; nothing.

The last part of the code renames the columns into a cleaner format. Infact I could have use list comprehension and regular expressions for this part aswell, but simpler just hard fix it!

**Code Interpretation: Variables and Processing:** <br>
The first step is to address the the variables and dive into the pre-processing steps necessary to turn raw text into valuable output. This dataset’s notable variables include: review title and review body of clothing product, rating assigned to the product, age of customer, whether the product was recommended, and finally department and division.

In order to facilitate the use of sentiment analysis, a new boolean variable is created to categories good and bad reviews. All reviews with a rating of 3 and over, were deemed good, and reviews under 3 deemed bad. This step is especially important for the use of Naive Bayes’ supervised learning algorithm, since it requires a clear binary label to train upon.
***

In [3]:
# Replace Other with Dress
df["Department Name"].replace({'Other':'Dresses'}, inplace=True)

# Regular Expression to remove "Petites" specification in "Class Name"
class_length_before = len(set(df["Class Name"]))
df["Class Name"]= df["Class Name"].str.replace(
    r"(petites)|(petite)", '').str.strip().str.lower().str.capitalize()

# Fix Knits Problem:
df.loc[df["Class Name"] == "Cut and sew knits", "Class Name"]= "Knits"
class_length_after = len(set(df["Class Name"]))
                         
print("Class Name Category Count:\nBefore: {}\nAfter: {}\nDifference: {}".format(
    class_length_before, class_length_after,class_length_before-class_length_after))

Class Name Category Count:
Before: 33
After: 22
Difference: 11


**Fixing the Redundancy:** <br>
While the "Division Name" variable is supposed to distinguish *General* and *Petite General*, the **Department Name** variable interferes with the category since it has its own *Petite* category. Here, this redundancy is fixed.

In [4]:
# Disolve Petite from Department Name, so it can be applied properly through Division Name
for x in set(df["Class Name"][df["Class Name"].notnull()]):
    df.loc[(df["Class Name"] == x) & (df["Department Name"] == "Petites"), "Department Name"] = \
    df.loc[(df["Class Name"]== x) & (df["Department Name"] != "Petites"), "Department Name"].mode()[0]

In [5]:
# Extracting Missing Count and Unique Count by Column
unique_count = []
for x in df.columns:
    unique_count.append([x,len(df[x].unique()),df[x].isnull().sum()])
unique_count

[['Clothing ID', 1206, 0],
 ['Age', 77, 0],
 ['IDD', 39, 0],
 ['Title', 13994, 3810],
 ['Review Text', 22635, 845],
 ['Rating', 5, 0],
 ['Recommended IND', 2, 0],
 ['Feedback Count', 82, 0],
 ['Positive Feedback Count', 82, 0],
 ['Division ID', 4, 14],
 ['Division Name', 4, 14],
 ['Department ID', 8, 14],
 ['Department Name', 7, 14],
 ['Class ID', 33, 14],
 ['Class Name', 21, 14]]

Many variables appear to be unnecessary, and may compromise the human subjects. Therefore, there are removed.

Furthermore, even if "IDD" were identification keys, it would suggest that each individual published ten thousand reviews on average.

In [6]:
# Encoding to untrackable integer
df["Clothing ID"]= df["Clothing ID"].astype("category").cat.codes

# Dropping Variable Deemed Unworthy
df.drop(["Department ID", "Division ID", "Class ID","Feedback Count", "IDD"], axis = 1, inplace=True)

Since Clothing ID may be used to track the data source, I encode it as a categorical interger that begins at 0. 

In [7]:
# And Save
df.to_csv("Women's Clothing E-Commerce Reviews.csv")

In [8]:
# Data Dimensions
print("Dataframe Dimension: {} Rows, {} Columns".format(*df.shape))
df.sample(7)

Dataframe Dimension: 23486 Rows, 10 Columns


Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
5144,863,41,Cute tee,Very flattering tee. nice details & quality.\r...,5,1,1,General,Tops,Knits
14388,1060,50,Perfect white pants,I love these pants purchased them in white the...,5,1,0,General Petite,Bottoms,Pants
5131,873,63,Great white top.,"Finally, a white top that does not show every ...",5,1,2,General Petite,Tops,Knits
15136,890,35,Really really ridiculously good looking,I saw this sweater and just about died. i love...,5,1,2,General Petite,Tops,Fine gauge
6389,1036,33,"Great fit, just one small prob",I like the jeans a lot. they're definitely ski...,4,1,0,General,Bottoms,Jeans
15325,1104,41,Absolutely love this,This is my fourth amadi piece from retailer. ...,5,1,8,General,Dresses,Dresses
676,1059,60,"Great material, awkward length",The material and construction of the pants are...,3,0,0,General,Bottoms,Pants


***
## Introduction to Dataset

Welcome. This is a Women’s Clothing E-Commerce dataset revolving on the written reviews by customers. Its eight supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.

## Variable Explanation

This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:
- **Clothing ID:** Integer Categorical variable that refers to the specific piece being reviewed.
- **Age:** Positive Integer variable of the reviewers age.
- **Title:** String variable for the title of the review.
- **Review Text:** String variable for the review body.
- **Rating:** Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
- **Recommended IND:** Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
- **Positive Feedback Count:** Positive Integer documenting the number of other customers who found this review positive. 
- **Division Name:** Categorical name of the product high level division.
- **Department Name:** Categorical name of the product department name.
- **Class Name:** Categorical name of the product class name.
I look forward to come quality NLP!
