#### 1. Business Understanding

JCPenney is a North American department store that was founded in the year 1902 at Kemmerer Wyoming by James Cash Penney. It's headquaters is at Plano in Texas where it is registered as J.C. Penney Corporation, Inc. to trade in goods and services. In goods, JCPenney is a merchant of clothing, foot wears for men, children, women (including plus size), home products such as beddings, bath, toiletories, kitchen wares and windows accessories. Other accessories in the collection of their products include handbags, jewelries and beuaty products. In addition, they are retailers of branded product such as Nike, Levi, Stanford men's tailored clothing etc. They offer services that include styling salon, optical centers, custom decorating etc. To reach its numerous customer base, the organisation has approximately 647 physical stores outlets in America and Puetorico and maintains an appreciable online presence using its e-commerce platform (jcpenney.com, not currently available in the UK) through which it is able to market goods and services. On the average, it is reported that it reached an of approximate of 26 million online view and a staggering 65 million per month during peak holiday season. Owing to this huge online presence and face to face physical contact with customers and clienteles especially during shopping and provision of services such as salon styling, the organisation is therefore required to maintain a business success strategy that entails  customer satisfaction through excellent services, feedback with a view to mitigate intense negative feedback and churn risk. In addition, to maintain a healthy business to cutomer interaction, it is important that the database system void of data corruption to ensure that accurate customer data are used for personalised customer to business interaction, reliable, secure and accurate decision making.
Sequel to the above business understanding, the <b>objective</b> of this research is defined as follows:

1. To identify the spread of the intensity of customer sentiment across United State and Territory
2. To identify weeknesses in the collection and storage of data with a view to mititgate ineffective customer communication and combat the risk of cyber security if neccessary.

**Data to Use:**
1. all columns of reviews.csv data
2. Price and Av_Score columns from products.csv data
3. all columns from jcpenney_reviewers.json
4. all columns of jcpenney_reviewers.json
5. users.csv excluded due to similarities with jcpenney_reviewers.json

#### Abstract : How I answered the above questions to create the project

The strategies implemented is to go from data to actionable business solutions so as to align data initiatives with core business objectives
1. identified customers birthday - no increase in the numbers of birthdays celebrated over the period under review
2. identify duplicated user names
3. identified duplicated empty data structures including strings and list in the databases
4. vader model to gerate polarity scores
5. use kmeans algorithm to fine tune model because the dataset is unlabled
6. use hugging face to generate negative and or positive sentiments
7. The use of linear regression was considered as means to establish relationship between average customer score and price of the product. The process was abandoned because there was no dependable correlation between the columns
8. Comparative analysis was made between the alogortihms used and result dataframe was created from the model and the outcome was used to visualize the sentiments of customers across the united states

#### b. Importing libraries and loading data

In [8]:
##### import libraries for numerical calculations, data and datetime processing
import numpy as np
import pandas as pd
import datetime as dt

# imported for visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import folium

#from folium.plugins import FastMarkerCluster
from folium.plugins import MarkerCluster
from folium.plugins import HeatMap
import branca.element as be
import branca.colormap as cm

# for processing text data, build model that generates vader score for intensity of customer ratings
import re 
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer # vader score processing
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from tqdm.notebook import tqdm

# kmeans is used to cross validate vader output and fine tune the vader score
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

# to build huggy face ai model and make comparison with other librariew
from transformers import pipeline

import warnings
warnings.filterwarnings("ignore")

In [9]:
#### Load Dataframe
csv_reviews_data = pd.read_csv('data/reviews.csv')
csv_products_data = pd.read_csv('data/products.csv')
csv_user_data = pd.read_csv('data/users.csv')

# The 'lines=True' parameter is used here for json's file to mitigate ValueError: Trailing data
json_reviewers_data = pd.read_json('data/jcpenney_reviewers.json', lines = True)
json_products_data = pd.read_json('data/jcpenney_products.json', lines=True)

In [12]:
states_lat_long_df=pd.read_csv('data/states_data.csv')#Added: Latitude and Longitude data from another source

#### 2. Data understanding and preparation - 
Explore the data and show you understand its structure and relations, with the aid of appropriate visualisation techniques. Assess the data quality, which insights you would be able to answer from it, and what preparation the data would require. Add new data from another source if required to bring new insights to the data you already have.

##### (i). jcpenner_reviewers.json and reviews.csv

In [123]:
#csv_reviews_data.head(2)

In [21]:
json_reviewers_data.isna().sum() # to evaluate the numbers of rows that contain null values in each column

Username    0
DOB         0
State       0
Reviewed    0
dtype: int64

In [60]:
json_reviewers_data.dtypes

Username    object
DOB         object
State       object
Reviewed    object
dtype: object

In [77]:
json_reviewers_data.astype(str).duplicated(keep=False).sum()# to show if any duplicated data exists in all columns. This returned 0
                                                # dataframe is casted as type str to mitigate TypeError: unhashable type: 'list' because list are mutable
                                                # while string types are not

np.int64(0)

In [29]:
json_reviewers_data.head(3)# to show the first three columns and evaluate with visual observation if the dataset contains corrupted data

Unnamed: 0,Username,DOB,State,Reviewed
0,bkpn1412,31.07.1983,Oregon,[cea76118f6a9110a893de2b7654319c0]
1,gqjs4414,27.07.1998,Massachusetts,[fa04fe6c0dd5189f54fe600838da43d3]
2,eehe1434,08.08.1950,Idaho,[]


In [122]:
# Alternative code to show the first three columns of dataframe
json_reviewers_data.iloc[0:3,:]

Unnamed: 0,Username,DOB,State,Reviewed
0,bkpn1412,31.07.1983,Oregon,[cea76118f6a9110a893de2b7654319c0]
1,gqjs4414,27.07.1998,Massachusetts,[fa04fe6c0dd5189f54fe600838da43d3]
2,eehe1434,08.08.1950,Idaho,[]


**(a). Processing Reviewed column**

In [66]:
# complexity of the Reviewed column
print(f'Reviewed column is a:{type(json_reviewers_data['Reviewed'])}\nthe data type is:{json_reviewers_data['Reviewed'].dtypes}\nand each row is a:{type(json_reviewers_data['Reviewed'][0])}')

Reviewed column is a:<class 'pandas.core.series.Series'>
the data type is:object
and each row is a:<class 'list'>


In [73]:
jrev_rows_with_empty_list=json_reviewers_data[json_reviewers_data['Reviewed'].apply(len) == 0] # we target rows that has a len of 0
print(f'Numbers of rows with empty list: {jrev_rows_with_empty_list.shape[0]}')

Numbers of rows with empty list: 971


In [119]:
print(f'Alternative code to find numbers of duplicated rows: {json_reviewers_data['Reviewed'].duplicated(keep=False).sum()}')

Alternative code to find numbers of duplicated rows: 971


In [74]:
percentage_rows_with_empty_list=(jrev_rows_with_empty_list.shape[0]/json_reviewers_data.shape[0])*100
print(f'Percentage of rows with empty list: {percentage_rows_with_empty_list}%')

Percentage of rows with empty list: 19.42%


**(b). Username column**

In [79]:
print(f' Numbers of Username duplicated rows: {json_reviewers_data.Username.duplicated(keep=False).sum()}')# This is in contrast to the result when the 
                                                                                                        # duplicated method was called on the complete dataframe

 Numbers of Username duplicated rows: 2


In [88]:
duplicated_username = json_reviewers_data[json_reviewers_data['Username'].duplicated(keep=False)]
duplicated_username

Unnamed: 0,Username,DOB,State,Reviewed
731,dqft3311,28.07.1995,Tennessee,[5f280fb338485cfc30678998a42f0a55]
2619,dqft3311,03.08.1969,New Mexico,[571b86d307f94e9e8d7919b551c6bb52]


In [94]:
percentage_duplicated_username = (duplicated_username.shape[0]/json_reviewers_data.shape[0])*100
percentage_duplicated_username

0.04

**(c).reviews.csv**

In [97]:
csv_reviews_data.isna().sum()

Uniq_id     0
Username    0
Score       0
Review      0
dtype: int64

In [99]:
csv_reviews_data.astype(str).duplicated(keep=False).sum()

np.int64(0)

In [100]:
csv_reviews_data.columns

Index(['Uniq_id', 'Username', 'Score', 'Review'], dtype='object')

In [102]:
csv_reviews_data.shape[0]

39063

In [108]:
#passing the duplicated username from json_reviewers_data to csv_reviews_data to extract username generated by the username
csv_duplicated_in_username = csv_reviews_data[csv_reviews_data['Username']=='dqft3311']
print(f'Numbers of user reviews by duplicated username: {csv_duplicated_in_username.shape[0]}')

Numbers of user reviews by duplicated username: 17


In [112]:
percent_data_generated_by_dupl_username=(csv_duplicated_in_username.shape[0]/csv_reviews_data.shape[0])*100
print(f'percent of data generated by duplicated username: {round(percent_data_generated_by_dupl_username,3)}%')

percent of data generated by duplicated username: 0.044%


In [114]:
json_reviewers_data['Reviewed'].duplicated(keep=False).sum()

np.int64(971)

In [124]:
json_reviewers_data

Unnamed: 0,Username,DOB,State,Reviewed
0,bkpn1412,31.07.1983,Oregon,[cea76118f6a9110a893de2b7654319c0]
1,gqjs4414,27.07.1998,Massachusetts,[fa04fe6c0dd5189f54fe600838da43d3]
2,eehe1434,08.08.1950,Idaho,[]
3,hkxj1334,03.08.1969,Florida,"[f129b1803f447c2b1ce43508fb822810, 3b0c9bc0be6..."
4,jjbd1412,26.07.2001,Georgia,[]
...,...,...,...,...
4995,mfnn1212,27.07.1997,Delaware,[d6cd506246bd17afa611b6a06236713c]
4996,ejnb3414,01.08.1976,Minnesota,[97de1506cd0bcbe50f2797cd0588eb81]
4997,pdzw1433,28.07.1994,Ohio,"[799d62906019d910fa744987da184ae7, b8f5deb7b02..."
4998,npha1342,07.08.1953,Montana,[6250b1d691cd3842f05b87736f2fadbf]


**(d). observation:**
As revealed above, the dataframe.duplicated(keep=False) method returned a clean bill of health to json_reviewers_data dataframe but when called on Reviewed as a standalone column, it returned 971 duplicated rows. A closer investigation revealed that column is a pandas series with rows that contain lists of string elements. And because an empty list is a hashable value, dataframe.isna().sum() cannot classify it as a null value hence the result. Although this does not expalain why the dataframe.duplicated(keep=False) returned True when there are obvious duplicated rows of empty lists in the Reviewed column, the argument values contained in other rows has the tendency to make each of the rows in the dataframe unique thereby preventing function() to return False on the dataframe.
According to the construct of json_reviewers_data, each Username is allowed to make multiple reviews hence, the Reviewed column is designed to contain a list of as many unique_ids of items that are reviewed by the user. However, because each Username represent a distinct living individual, it becomes questionable when duplicated Username exists in a database. Although from our findings above, only one Username(dqft3311) is found to be duplicated with two entries at index 731 and 2619 which show different DOBs originating from Tenessee and New Mexico. The multiplier effect of this is that when the Username is passed on to csv_reviews_data, it revealed 17 different entries by the same Username but at this time it is not possible to tell if the entry was made by the user from Tennessee or New Mexico. Such duplication has the tendency to lead to flawed Analytics, defective data integrity, operational inefficiency, faulty decision-making and a precusor to cyber vulnerablity.

**(e). Data to delete:**

(i). <b><i/>The Reviewed column of the jcpenney_reviewers.json data.</i><b/> 

**Justification to delete:**

   * a. It contains 971 rows of empty list which is a corrupted data
   * b. Deleting the column instead of the rows with empty list will preserve all entries in Username column  which is most relevant to this project
   * c. The information contained in the Reviewed column is still available in the reviews.csv in a different format. 

(ii).<b><i/> Contents of the Username dqft3311</i><b/>

 **Justification to delete:**
* The username is duplicated. Deleting it will ensure data integrity, improve performance of the model, enhance storage capacity and help to mitigate threat of cyber security in the database as a whole

In [126]:
# Delete username == dqft3311 from jcpenneys_reviewers.json data and save dataframe as reviewers dataframe
reviewers_df = json_reviewers_data.drop(json_reviewers_data[json_reviewers_data['Username'] == 'dqft3311'].index)

# delete Reviewed column from dataframe
reviewers_df.drop(columns=['Reviewed'], inplace=True)

# Delete username == dqft3311 from reviews.csv data and save dataframe as reviews dataframe
reviews_df=csv_reviews_data.drop(csv_reviews_data[csv_reviews_data['Username'] == 'dqft3311'].index)

####  3. **Data modeling (optional)** - Would modeling be required for the insights you have considered? Use appropriate techniques, if so.
**(i). Linear Regression model to show relationship between scores and price**