# Introduction
## What Is Wish?
According to a Google search, **Wish** is an online e-commerce that connects millions of customers in over 60 countries to 250,000 merchants globally.

<img src="https://www.techzilla.it/wp-content/uploads/2019/03/imageproxy-1200x402.png" width="700">

In the site we can find different categories of products such as:
* Technology (Laptops, Chargers, Hardware etc.)
* Car and House Accessories
* Clothes

## What Is This Dataset About?
In this analysis we are going to focus in the **Clothes** category. From what we can read in the description of the dataset in kaggle, the data was collected in **august 2020** by searching the word **"summer"**.





# 1. Data Cleaning

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', 50)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#loading and showing head of dataset
df = pd.read_csv("/kaggle/input/summer-products-and-sales-in-ecommerce-wish/summer-products-with-rating-and-performance_2020-08.csv")
df.head(4)

In [None]:
df.columns

There is a total of **43** columns, let's try to reduce that number.

In [None]:
df = df[['title_orig', 'price', 'retail_price',
        'units_sold', 'uses_ad_boosts', 'rating', 'rating_count',
       'rating_five_count', 'rating_four_count', 'rating_three_count',
       'rating_two_count', 'rating_one_count', 'badges_count',
       'badge_local_product', 'badge_product_quality', 'badge_fast_shipping',
       'tags', 'shipping_option_price','shipping_is_express', 'countries_shipped_to',
        'has_urgency_banner','merchant_rating_count', 'merchant_rating','merchant_has_profile_picture',
        ]]



**Some considerations:**
* Excluded product color and size because it's just showing the ones that were found in getting the data : there are more colors and sizes for each product so having just one for each category doesn't say much.
* Excluded currency used, as every price is in EUR
* Excluded shipping option, the "shipping_is_express" column is enough.
* Excluded product's and merchant's id's and pictures, that won't be needed.

*Now let's look at particular values that are missing and see if we have to modify columns*

In [None]:
df.info()

In [None]:
df.isnull().sum()

After loading the data we se we have **1573** entries and some null values, let's see what some of them are about:
* Exactly 45 for rating counts from 1 to 5 star, could it be that there are products with no ratings at all?
* 1100 null values in the has urgency banner : the values there are 1's so the null values should be    turned into 0's

In [None]:
#rename columns
df = df.rename(columns={'has_urgency_banner': 'is_running_out',
                       'title_orig': 'title'})
#fix the running out column
df['is_running_out'] = df['is_running_out'].fillna(0)

df[df['rating_five_count'].isna()][['rating_four_count', 'rating_three_count',
       'rating_two_count', 'rating_one_count']].isna().sum()

*As expected we see the null values in the rating are actually of the same 45 products*

In [None]:
df[df['rating_five_count'].isna()].sample(10)

*They seem to be having a rating of 5 even with no ratings: we are going to change the rating and all the rating counts to 0*

In [None]:
#changing rating to 0
df.loc[df['rating_five_count'].isna(), 'rating'] = 0

df.loc[df['rating_five_count'].isna(), ['rating_five_count',
                                        'rating_four_count', 'rating_three_count',
                                       'rating_two_count', 'rating_one_count']] = 0
df.columns

**We now have a completely clean dataset and ready to explore it!**

# 2. Data Exploration

## 2.1 Successful Products 
> Let's start looking at the most successful products and show the top 10 of the most sold ones.

In [None]:
df.head()

In [None]:
top_10_products = df.sort_values(by='units_sold', ascending=False).head(10)


# Create a horizontal bar plot

sns.barplot(x='units_sold', y='title', data=top_10_products, hue = 'units_sold')
plt.xlabel('Units Sold')
plt.ylabel('Product Title')
plt.title('Top 10 Products by Units Sold')
plt.show()

*We can see that:*
* Most selling products are **mini dresses** and **bikinis**
* The words **sexy** is repeated many times

> Now let's look at the number of units sold, since it's strange they are exactly 100000 or 50000

In [None]:
df['units_sold'].unique()

It seems like they are all numbers approximated to their lower bound. So let's pretend they are divided in ranges like this:
* 10-100
* 100-1000
* 1000-5000
* 5000-10000
* 10000-20000
* 20000-50000
* 50000-100000
* 100000+
> Let's round the numbers below 10 to 10 and define 3 performance characteristics:
1. Average : 10-1000
2. Successful: 1000-20000
3. Very Successful 20000+

In [None]:
#round to 10
df.loc[df['units_sold'] < 10, 'units_sold'] = 10
units = np.sort(df['units_sold'].unique())

#show units sold by range

ranges = ['10-50','50-100','100-1000', '1000-5000', '5000-10000', '10000-20000', '20000-50000', '50000-100000', '100000+']
def find_units_count():
    units_array = []
    for x in units:
        times = df.loc[df['units_sold'] == x, 'units_sold'].count()
        units_array.append(times)
    return units_array
        
units_count = find_units_count()
units_count
        
sns.barplot(x=ranges, y=units_count, palette='mako')
plt.title('Count of Units Sold by Range')
plt.xlabel('Units Sold Range')
plt.ylabel('Count')

# Display the plot
plt.xticks(rotation=50, ha='right')
plt.show()

In [None]:
#average if < 1000, successful if  1000<=x<20000 ,very successful >=20000
df['performance'] = df['units_sold'].apply(lambda x: 'average' if x < 1000 
                                           else ('successful' if x < 20000 else 'very_successful'))
#pie plot
plt.style.use('bmh')
counts = df['performance'].value_counts()
plt.pie(counts, autopct='%1.1f%%', labels = counts.index)
plt.title('Distribution of Products Performance')
plt.show()

*Looking at the two plots we can see that:*
* Majority of products in the dataset are between the **100-1000** and the **1000-5000** range, but what surprises me the most is that there are more products with *5000-10000* units sold than *50-100* units sold: seems like these summer products are selling well!
* **50%** of the products are successful based on the filters i put, but is there something different between average, sucessful and very successful products?

We will try to answer the last question in the next steps.

## 2.2 Impact of Ratings on Success

> Let's look at why products are successful, starting with rating.

In [None]:
df.columns

In [None]:
df[['title','rating', 'rating_count', 'rating_five_count', 'rating_four_count',
       'rating_three_count', 'rating_two_count', 'rating_one_count']]

> Looking at the rating columns i notice two things:
1. Some products have a rating count that's too low to be considered, let's only look at products with more than 150 ratings to try to have a fair result
2. The rating of the products is a decimal number: let's create another column that categorizes each product into it's rating range.

In [None]:
df_ratings = df[df['rating_count'] > 100]
df_ratings = df_ratings[['title','price', 'retail_price', 'units_sold','rating', 'rating_count', 'rating_five_count', 'rating_four_count',
       'rating_three_count', 'rating_two_count', 'rating_one_count','performance']]
df_ratings['rating_count'].count()

> We have 876 products so that's a good amount to work with

In [None]:
df_ratings['rating_range'] = df_ratings['rating'].apply(lambda x: '< 1' if x < 1 
                                           else '1-2' if x < 2 
                                           else '2-3' if x < 3
                                           else '3-4' if x < 4
                                           else '4-5')
df_ratings['rating_range'].value_counts()

> We see something strange, there are only 13 products in the 2-3 range and nothing below that range when we filter for more than 100 ratings.Let's just display the 3-4 and 4-5 ranges results.

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
df_grouped = df_ratings.groupby(['rating_range', 'performance']).size().reset_index(name='counts')
df_grouped = df_grouped[df_grouped['rating_range'] != '2-3']
# Plot the data with hue based on 'performance'
sns.barplot(x='rating_range', y='counts', hue='performance', data=df_grouped, palette='Set2')
plt.ylim(0, 500)
# Add labels and title
plt.xlabel('Rating Range')
plt.ylabel('Count of Products')
plt.title('Product Performance by Rating Range')

# Show the plot
plt.show()

**Even though the 2-3 range and below don't give us much information, we see that with over 100 reviews the products with 3+ rating have a really good chance to be successful, being that it gives faith to the customer that they will probably like the product if other people already did.**

## 2.3 Impact of Merchant Reputation

> Now let's look at data about the merchants

In [None]:
df.columns

In [None]:
df.head(3)

*columns that are important are : merchant_rating_count, merchant_rating and merchant_has_profile picture*

*let's see how much they affect performance*

In [None]:
average_ratings = df.groupby('performance')['merchant_rating_count'].mean()
average_ratings

In [None]:
import matplotlib.cm as cm
cmap = cm.get_cmap('summer', len(average_ratings))
colors = cmap(np.linspace(0, 1, len(average_ratings)))

average_ratings.plot(kind='bar', color = colors, figsize=(8, 6))

plt.title('Average Merchant Ratings')
plt.xlabel('Performance')
plt.ylabel('Average Merchant Rating Count')
plt.show()

*As expected wee see that merchants with higher rating count have the best performance*

In [None]:
df['merchant_rating'].min()

In [None]:
df['merchant_rating_range'] = df['merchant_rating'].apply(lambda x: '0-1' if x < 1 
                                           else '1-2' if x < 2 
                                           else '2-3' if x < 3
                                           else '3-4' if x < 4
                                           else '4-5')
df['merchant_rating_range'].value_counts()

*Again we see just 2 products in the 2-3 rating while the others are much higher.Let's display the results*

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
df_grouped = df.groupby(['merchant_rating_range', 'performance']).size().reset_index(name='counts')
df_grouped = df_grouped[df_grouped['merchant_rating_range'] != '2-3']
sns.barplot(x='merchant_rating_range', y='counts', hue='performance', data=df_grouped, palette='Set2')
plt.ylim(0, 500)

plt.xlabel('Merchant Rating Range')
plt.ylabel('Count of Products')
plt.title('Product Performance by Merchant Rating Range')

# Show the plot
plt.show()

> Unlike the Rating of single products, here we see that having higher merchant rating has significantly more importance on having successful and very successful products.

> Lastly, let's check if having a profile picture somewhat helps with performance

In [None]:
# Group by performance and has_profile_picture
pfp = df.groupby(['performance', 'merchant_has_profile_picture']).size().unstack(fill_value=0)

# Create subplots for pie charts
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

performance_classes = pfp.index  # Unique performance classes
for i, performance in enumerate(performance_classes):
    values = pfp.loc[performance]
    labels = ['No Profile Picture', 'Has Profile Picture']
    colors = ['#ff9999', '#66b3ff']  # Custom colors for the pie charts
    
    # Create pie chart
    axes[i].pie(values, labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
    axes[i].set_title(f'{performance.capitalize()}')

# Add a main title
plt.suptitle('Profile Picture Distribution by Performance Class', fontsize=16)
plt.tight_layout()
plt.show()


> We can see that the more successful the product is, the more having a profile picture matters, but I wouldn't consider it an high priority factor in achieving success, but it's still something that could help with having the customer's faith

## 2.4 Impact of Badges, Banners, and Shipping Options

> Lastly, let's look at the little things such as badges, banners and shipping options that could help in having a successful product

> This time, we will look at influence looking at a correlation matrix

In [None]:

df_badges = df[['uses_ad_boosts','badges_count', 'badge_local_product', 'badge_product_quality',
                'shipping_is_express', 'countries_shipped_to','is_running_out',
                'units_sold','rating','merchant_rating']]
corr = df_badges.corr()
plt.figure(figsize=(7, 6))
sns.heatmap(corr, cmap ="RdBu",annot = True,fmt=".2f")

**Conclusions from the heatmap:**
* **Increasing Units Sold:** We see a weak positive correlation with rating and merchant rating, meaning improving customer reviews can boost sales.
* **Improve Rating:** Rating seems to be affected most by product quality badges, so it is important to use qualitative material to get that; furthermore we see merchant rating is affected more than rating by *local product badge*,meaning it's  important to showcase the materials used for the product come from local suppliers and the *numbers of countries product is shipped to*,so if having a low merchant rating that's what you should focus on improving.

# 3. Predicting Product Performance

 Using the information we got from our data exploration, let's use the data i considered most impactful 
 on performance
 to train a **Logistic Regression** model on predicting product performance between
 1 : Average, 2: Successful, 3: Very Successful

In [None]:
#Dividing performance in categories 1 = Average | 2 = Successful | 3 = Very Successful
df['performance'] = df['performance'].apply(
    lambda x: 1 if x == 'average' else 2 if x == 'successful' else 3
)
df['performance'] = df['performance'].astype('int64')
df_reg = df[['price', 'retail_price','rating', 
             'rating_count', 'badge_local_product', 'badge_product_quality',
            'shipping_is_express','countries_shipped_to',
             'merchant_rating_count', 'merchant_rating','merchant_has_profile_picture','performance']]



In [None]:
df_reg.describe()

Here i import the libraries needed to create the model doing two things: 
* Using a test size of 30% because I feel like the data we have isn't a lot and I don't want my model to overfit;
* Using a scaler because we have big data like **rating count** and small boolean-like data such as **shipping_is_express**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler

#pick features and target values
features = df_reg.drop(columns = 'performance')
target = df_reg['performance']
#split dataset in training and test
X_train, X_test, y_train, y_test = train_test_split(
    features,
    target,
    test_size = 0.3,
    random_state = 42
)
#in the dataset we're using there are values that are really high(such as rating_count) and values
# that are really small(such has shipping_is_express) so let's use a scaler
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # only transform because model doesn't want to know mean/std


After doing that, let's try and find the best parameters to put in our model. Again doing two things:
* Using **GridSearchCV** library to choose between the best C parameter (regularization parameter);
* Doing this with cross validation across 5 folds to see if the model consistently predicts well and doesn't overfit

In [None]:
#Create model and use gridsearchcv to find the better C parameter(regularization parameter)
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10, 50,100]}

grid_search = GridSearchCV(LogisticRegression(random_state=0, max_iter=500), 
                           param_grid, 
                           cv=5, #5 fold cross validation to check for consistency
                           scoring='accuracy')

grid_search.fit(X_train_scaled, y_train)

print(f"Best Parameter: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")


**We get really high accuracy, so let's check the results with test results**

In [None]:
#get the model
model = LogisticRegression(random_state=0, 
                           max_iter=500, 
                           C = 100)
model.fit(X_train_scaled,y_train)

#calculate prediction results
print(f"Train Results : {model.score(X_train_scaled,y_train):.4f}\n"
      f"Test Results: {model.score(X_test_scaled,y_test):.4f}")

**We get an amazing result of 90% accuracy on the test data, let's finish by showing the Confusion Matrix, a table that measures performance of the model showing the values predicted and comparing them to the actual values**

In [None]:
#get the predictions in y_pred
y_pred = model.predict(X_test_scaled)

#See results via the confusion matrix
cm = confusion_matrix(y_test,y_pred)
plt.figure(figsize=(5, 3))
sns.heatmap(cm, cmap='Blues', annot=True, fmt='g',
           xticklabels = [1,2,3],
           yticklabels = [1,2,3])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

**Looking at the Confusion Matrix we see our model is really good distinguishing between categories 1 and 3, but struggles slightly to predict category 2 correctly, likely because it's the one in the middle**

# Conclusions

* ***Rating*** : It has a good impact on units sold but isn't fundamental, to improve it it's important to have **good quality** products to impress customers.
* ***Merchant Rating*** : It's the one that affects units sold the most, in 2.3 we saw how having ratings between 4 and 5 and much better results in performance than it being in the 3-4 range. At the end of 2.4 we discovered the factors that improve it the most are the **number of countries** the products are shipped at and the fact that the product is made using **local supplies** , so these are what you should be looking for to have an higher merchant rating.
* ***Units sold*** : to summarize, having high ratings and a big number of reviews impacts units sold a lot, but if you want a very successful product, it's key to take care of even smaller things, like using local supplies and qualitative materials to make the products.