### Analysis of an E-commerce Dataset

We have been provided with a combined e-commerce dataset. In this dataset, each user has the ability to post a rating and review for the products they purchased. Additionally, other users can evaluate the initial rating and review by expressing their trust or distrust.

This dataset includes a wealth of information for each user. Details such as their profile, ID, gender, city of birth, product ratings (on a scale of 1-5), reviews, and the prices of the products they purchased are all included. Moreover, for each product rating, we have information about the product name, ID, price, and category, the rating score, the timestamp of the rating and review, and the average helpfulness of the rating given by others (on a scale of 1-5).

The dataset is from several data sources, and we have merged all the data into a single CSV file named 'A Combined E-commerce Dataset.csv'. The structure of this dataset is represented in the header shown below.

| userId | gender | rating | review| item | category | helpfulness | timestamp | item_id | item_price | user_city|

    | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |  ---- |  ---- |  
    
#### Description of Fields

* __userId__ - the user's id
* __gender__ - the user's gender
* __rating__ - the user's rating towards the item
* __review__ - the user's review towards the item
* __item__ - the item's name
* __category__ - the category of the item
* __helpfulness__ - the average helpfulness of this rating
* __timestamp__ - the timestamp when the rating is created
* __item_id__ - the item's id
* __item_price__ - the item's price
* __user_city__ - the city of user's birth

Note that, a user may rate multiple items and an item may receive ratings and reviews from multiple users. The "helpfulness" is an average value based on all the helpfulness values given by others.

There are four questions to explore with the data as shown below.



<img src="data-relation.png" align="left" width="400"/>
(You can find the data relation diagram on iLearn - Portfolio Part 1 resources - Fig1)


 #### Q1. Remove missing data

Please remove the following records in the csv file:

 * gender/rating/helpfulness is missing
 * review is 'none'

__Display the DataFrame, counting number of Null values in each column, and print the length of the data__ before and after removing the missing data.  

In [20]:
# your code and solutions
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
columns = ['userId', 'timestamp', 'review', 'item', 'rating', 'helpfulness', 'gender', 'category', 'item_id' , 'item_price','user_city']
rate = pd.read_csv('/Users/esther/Documents/GitHub/Portfolio-part-1/E-commerce_Dataset.csv', header=0, names=data)
rate = rate.dropna(subset=['gender', 'rating', 'helpfulness'], how='all')
rate = rate[rate['review'] != 'none']
rate.head()



Unnamed: 0,userId,timestamp,review,item,rating,helpfulness,gender,category,item_id,item_price,user_city
0,4051,12807,Great job for what it is!,eBay,5.0,2.0,F,Online Stores & Services,88,149.0,39
1,4052,122899,Free Access Worth your Time,NetZero,5.0,0.0,F,Online Stores & Services,46,53.0,39
2,33,12700,AOL..I love you!!!!!!!!!!!!,AOL (America Online),5.0,4.0,F,Online Stores & Services,0,145.84,31
3,33,21000,EBAY!!! I LOVE YOU!!!! :-)*,eBay,5.0,4.0,F,Online Stores & Services,88,149.0,31
4,33,22300,Blair Witch...Oh Come On.......,Blair Witch Project,1.0,4.0,F,Movies,12,44.0,31


#### Q2. Descriptive statistics

With the cleaned data in Q1, please provide the data summarization as below:

* Q2.1 total number of unique users, unique reviews, unique items, and unique categories
* Q2.2 descriptive statistics, e.g., the total number, mean, std, min and max regarding all rating records
* Q2.3 descriptive statistics, e.g., mean, std, max, and min of the number of items rated by different genders
* Q2.4 descriptive statistics, e.g., mean, std, max, min of the number of ratings that received by each items


In [16]:
#Q2.1 
unique_users = rate['userId'].nunique()
unique_reviews = rate['review'].nunique()
unique_items = rate['item'].nunique()
unique_categories = rate['category'].nunique()
print(unique_users)
print(unique_reviews)
print(unique_items)
print(unique_categories)
sum = unique_users+ unique_reviews+ unique_items+ unique_categories
print("Total number of unique categories:", str(sum ))

8577
19523
89
9
Total number of unique categories: 28198


In [17]:
#Q2.2
desc_stats = rate.describe()

print(desc_stats)

             userId      timestamp        rating   helpfulness       item_id  \
count  19982.000000   19982.000000  19965.000000  19960.000000  19982.000000   
mean    5499.854819   59030.355270      3.702229      2.595842     41.790962   
std     3343.355825   37973.395109      1.404524      1.750862     27.272644   
min        0.000000   10100.000000      1.000000      0.000000      0.000000   
25%     1994.000000   21500.000000      3.000000      0.000000     17.000000   
50%     5894.000000   52701.000000      4.000000      4.000000     41.000000   
75%     8407.750000   91600.000000      5.000000      4.000000     65.000000   
max    10808.000000  123199.000000      5.000000      4.000000     88.000000   

         item_price     user_city  
count  19982.000000  19982.000000  
mean      82.182478     19.396006  
std       42.239554     11.626129  
min       12.000000      0.000000  
25%       48.250000      9.000000  
50%       72.000000     19.000000  
75%      126.500000     29.

In [51]:
#Q2.3 
m_gender = rate['gender'] == 'M' 
f_gender = rate['gender'] == 'F' 
item = rate['item']

m_count = rate[m_gender & item]
f_count = rate[f_gender & item]

m_count_stats = m_count.describe()
f_count_stats = f_count.describe()
print(m_count_stats)
print(f_count_stats)


males=rate[rate.gender=="M"]
females=rate[rate.gender=="Female"]

item_count=rate[rate.income ==">50K"]
males_high=males[males.income==">50K"]
females_high= females[females.income==">50K"]

             userId      timestamp        rating   helpfulness       item_id  \
count  10142.000000   10142.000000  10135.000000  10130.000000  10142.000000   
mean    5466.158746   58827.812956      3.686828      2.605035     42.114770   
std     3362.602066   37819.533095      1.413306      1.749794     27.283062   
min        2.000000   10100.000000      1.000000      0.000000      0.000000   
25%     1934.000000   21600.000000      3.000000      0.000000     18.000000   
50%     5883.500000   52700.000000      4.000000      4.000000     41.000000   
75%     8423.000000   91400.000000      5.000000      4.000000     66.000000   
max    10808.000000  123199.000000      5.000000      4.000000     88.000000   

         item_price     user_city  
count  10142.000000  10142.000000  
mean      81.819895     19.440643  
std       42.336249     11.537541  
min       12.000000      0.000000  
25%       48.250000      9.000000  
50%       71.520000     19.000000  
75%      126.500000     29.

KeyError: 'str(m_gender)'

#### Q3. Plotting and Analysis

Please try to explore the correlation between gender/helpfulness/category and ratings; for instance, do female/male users tend to provide higher ratings than male/female users? Hint: you may use the boxplot function to plot figures for comparison (___Challenge___)
    
You may need to select the most suitable graphic forms for ease of presentation. Most importantly, for each figure or subfigure, please summarise ___what each plot shows___ (i.e. observations and explanations). Finally, you may need to provide an overall summary of the data.

In [None]:
# your code and solutions

#### Q4. Detect and remove outliers

We may define outlier users, reviews and items with three rules (if a record meets one of the rules, it is regarded as an outlier):

1. reviews of which the helpfulness is no more than 2
2. users who rate less than 7 items
3. items that receives less than 11 ratings

Please remove the corresponding records in the csv file that involves outlier users, reviews and items. You need to follow the order of rules to perform data cleaning operations. After that, __print the length of the data__.

In [None]:
# your code and solutions