# Women's Clothing E-Commerce with Natural Language Processing 
_by Nick "Upping his game for 2018" Brooks, Janurary 2018_

- [**Github**](https://github.com/nicapotato)
- [**Kaggle**](https://www.kaggle.com/nicapotato/)
- [**Linkedin**](https://www.linkedin.com/in/nickbrooks7)

***

**Programming Language:** Python 3.5 in the Jupyter Notebook Environment

**Textbook Resources Used:** <br>
- Swamynathan, Manohar. Mastering Machine Learning with Python in Six Steps: a Practical
- Implementation Guide to Predictive Data Analytics Using Python. Apress, 2017.
- Bird, Steven. Natural Language Processing with Python. O&#39;Reilly Media, 2016.

**Code Navigation:** <br>
In the code, text after hastags (#) are supportive explanations, not executed as code.
Indented line signifies code is part of larger function or loop. Not standalone. Furthermore,
functions are used in order to facilitate the simplicity and exploratory process of the code.
Code: Packages Used

# Tables of Content:

**1. [Introduction](#Introduction)** <br>
**2. [Univariate Distribution](#Univariate)** <br>
**3. [Multivariate Distribution](#Multivariate)** <br>
	- 3.1 Categorical Variable by Categorical Variable
	- 3.2 Continuous Variable by Categorical Variable
	- 3.3 Continuous Variables  on Continuous Variables
	- 3.4 Percentage Standardize Distribution Plots
    
**4. [Multivariate Analysis](#Multianalysis)** <br>
	- 4.1 In-Depth Simple Linear Regression Analysis for Age mean and Recommended Likelihood    
	- 4.2 Residual Visualization [Like found in R Studio]
**5. [Working with Text](#Text)** <br>
	- 5.1 Text Pre-Processing
	- 5.2 Sentiment Analysis
**5. [Sentiment Analysis](#Sentiment Analysis)** <br>
**6. [Word Distribution and Word Cloud](#Word Distribution and Word Cloud)** <br>
**7. [N Grams by Recommended Feature](#NGRAM)** <br>
**8. [Supervised Learning](#Supervised Learning)** <br>
	- 8.1 Naive Bayes
**9. [Word2Vec](#Word2Vec)** <br>


# **1. Introduction:** <a id="Introduction"></a> <br>
This notebook is concerned with using the Python programming language and Natural Language Processing technology to explore trends in the customer reviews from an anonymized women’s clothing E-commerce platform, and extract actionable plans to improve its online e-commerce. The data is a collection of 22641 Rows and 10 column variables. Each row includes a written comment as well as additional customer information. This analysis will focus on using Natural Language techniques to find broad trends in the written thoughts of the customers. The total number of unique words in the dataset is 9810. 

My goal is to get to understand what it is the customers appreciate and dislike about their purchases. To reach this goal, I conduct an observational study of this sizable dataset, first by understanding the characteristics of individual features, and ramping the complexity of the analysis once a proper target is envisioned. 


# Summarized Findings:


In [1]:
# General
import numpy as np
import pandas as pd
import nltk
import random
import os
from os import path
from PIL import Image

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from subprocess import check_output
from wordcloud import WordCloud, STOPWORDS

# Set Plot Theme
sns.set_palette([
    "#30a2da",
    "#fc4f30",
    "#e5ae38",
    "#6d904f",
    "#8b8b8b",
])
# Alternate # plt.style.use('fivethirtyeight')

# Pre-Processing
import string
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
from nltk.stem import PorterStemmer

# Modeling
import statsmodels.api as sm
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
from nltk.util import ngrams
from collections import Counter
from gensim.models import word2vec

# Warnings
# import warnings
# warnings.filterwarnings('ignore')

  from pandas.core import datetools


**Code Explanation and Reasoning:** <br>
These packages are seperated in four categories: *General, Visualization, Pre-Processing, and Modeling*.

The General category includes the basic data manipulation tools for scientific computation (`numpy`), dataframes (`pandas`), Natural Language Processing (`NLTK`), path directory manipulation (`os`), and image saving (`PIL`).

The Visualiation section enables the creation of simple graphics (`matplotlib`, `seaborn`), aswell as `wordcloud`'s text frequency visualization.

The Pre-Processing section extracts more specialized modules from the NLTK package such as tokenizers and stemmers to enable the preperation of text data for mathematical analysis.

The Modeling section includes `nltk`’s sentiment analysis module, which can determine the mood of text, NLTK’s N-grams, and `gensim.models`’s word2vec. It also includes `statsmodels.api` which offers an array of linear models.

In [220]:
# Read and Peak at Data
df = pd.read_csv("Data/Women's Clothing E-Commerce Reviews.csv")
df.drop(df.columns[0],inplace=True, axis=1)

## 1. Preparation

**Converting Text to a Model-able format: One Hot Encoding**

In [None]:
df['tokenized'] = df["Review Text"].astype(str).str.lower() # Turn into lower case text
df['tokenized'] = df.apply(lambda row: tokenizer.tokenize(row['tokenized']), axis=1) # Apply tokenize to each row
df['tokenized'] = df['tokenized'].apply(lambda x: [w for w in x if not w in stop_words]) # Remove stopwords from each row
df['tokenized'] = df['tokenized'].apply(lambda x: [ps.stem(w) for w in x]) # Apply stemming to each row
all_words = nltk.FreqDist(preprocessing(df['Review Text'])) # Calculate word occurence from whole block of text

vocab_count = 300
word_features= list(all_words.keys())[:vocab_count] # 5000 most recurring unique words
print("Number of words columns (One Hot Encoding): {}".format(len(all_words)))

## Word2Vec

In [None]:
import gensim
from gensim.models import word2vec
import os
os.chdir(r"D:\My Computer\DATA\Retail")
os.listdir()

In [None]:
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

In [None]:
df.head()

In [None]:
w2vec = word2vec.Word2Vec(df["tokenized"], min_count=5, size=200)

In [None]:
w2vec.most_similar(["versatil"],topn=10)

In [None]:
w2vec.most_similar(["potato"],topn=10)

In [None]:
w2vec.most_similar(["worst"],topn=10)

In [None]:
w2vec.most_similar(["rag"],topn=10)

In [None]:
w2vec.most_similar(["compliment"],topn=10)

In [None]:
w2vec.most_similar(["love"],topn=10)

In [None]:
w2vec.most_similar(["shame"],topn=10)

In [None]:
w2vec.most_similar(["dear"],topn=10)