## Preparing the Retail Dataset for Kaggle Usuage

In [1]:
import pandas as pd

In [2]:
data = pd.ExcelFile("Women's Retail/retail dataset.xlsx")
df = data.parse("Sheet1", header=1)
df = df.drop(["Unnamed: 18", "Unnamed: 19", "Unnamed: 20", "Unnamed: 21","Unnamed: 22","Unnamed: 23","Unnamed: 24","Unnamed: 25","Unnamed: 26"], axis = 1)
df = df.drop(["REVIEW_ID", "RATING_RANGE", "NUM_NEGATIVE_FEEDBACKS"], axis = 1)

# Anonymize
df["DIVISION_NAME"].replace({'ANTHRO  INTIMATES (NA)': "Initmates",
                             "ANTHRO. WOMEN'S DIVISION (NA)":"Women",
                             "ANTHRO. MISC. DIVISION (NA)":"Misc"},inplace=True)
df["DEPARTMENT_NAME"]= df["DEPARTMENT_NAME"].str.replace(
    r"(WOMEN'S)|(NA)|[()]|(ANTHRO.)", '').str.strip().str.lower().str.capitalize()
df["CLASS_NAME"]= df["CLASS_NAME"].str.replace(
    r"[-]|(NA)|[()]|(ANTHRO.)", '').str.strip().str.lower().str.capitalize()

df.columns = ["Sterling External Alias", "Age", "IDD","Title","Review Text","Rating",
              "Recommended IND","Feedback Count","Positive Feedback Count","Division ID","Division Name",
              "Department ID","Department Name","Class ID","Class Name"]

### **Code Explanation:** <br>
The Path section changes the working directory of the python program to where the data is located. The Read Data section uses a Pandas function to read and parse the first sheet of the excel formatted retail dataset. Then, blank columns are removed, aswell as the unused variables “REVIEW_ID” and “RATING_RANGE”. Finally, the Transform Depedent Variable to Binary section creates a new variable.

Next, the comments themselves must be cleaned and anonymized.

**Regular Expressions:**
- Replace() : This command swaps every occurence of the first dictionary key with the dictionary value.
- str.strip() : This command removes white space before and after the start of text.
- str.lower() : This command lower cases all the text.
- str.capitaize() : This command capitalizes the first character of sentences.

**Regex Compilers:**
- r" " : The "r" signifies a string object, so that /, \, and ' dont get interpret as they generally do in python.
- r"(WOMEN'S)|(NA)|[()]|(ANTHRO.)" : Targets strings that match those in the brackets (blabla). The | is an AND argument, enabling the chaining of multiple string targets. Finally, all these string parts are replaces with "" ; nothing.

The last part of the code renames the columns into a cleaner format. Infact I could have use list comprehension and regular expressions for this part aswell, but simpler just hard fix it!

***

### **Code Interpretation: Variables and Processing:** <br>
The first step is to address the the variables and dive into the pre-processing steps necessary to turn raw text into valuable output. This dataset’s notable variables include: review title and review body of clothing product, rating assigned to the product, age of customer, whether the product was recommended, and finally department and division.

In order to facilitate the use of sentiment analysis, a new boolean variable is created to categories good and bad reviews. All reviews with a rating of 3 and over, were deemed good, and reviews under 3 deemed bad. This step is especially important for the use of Naive Bayes’ supervised learning algorithm, since it requires a clear binary label to train upon.

In [3]:
df = df.drop(["Department ID", "Division ID", "Class ID","Feedback Count"], axis = 1)
df.drop(["Sterling External Alias","IDD"], axis=1, inplace=True)
#df.set_index(["Sterling External Alias","IDD"], inplace=True)
df.to_csv("Women's Clothing E-Commerce Reviews.csv")

In [4]:
df.sample(7)

Unnamed: 0,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
7471,50,,this is a beautiful sweater that i wanted to l...,3,0,7,Misc,Petites,Fine gauge petites
3909,53,,,1,0,0,Women,Trend,Trend
20726,42,Summer staple,I bought this in white at my local store. Grea...,5,1,0,Women,Tops,Cut and sew knits
16355,30,gorgeous versatile skirt,I went to the Anthro store to look for an outf...,5,1,0,Misc,Petites,Skirts petites
15352,66,well designed,Love Cloth and Stone. This dress looks and fit...,5,1,13,Misc,Petites,Dresses petites
6060,40,,This is my favorite purchase so far! I am in l...,5,1,0,Women,Bottoms,Pants
4514,31,Classic and Comfortable,Love this dress so much. It's super comfortabl...,5,1,0,Women,Other,Dresses


## Introduction to Dataset

## Variable Explanation