# A. Web Scrapping

```python
# import required libraries 
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time

# prepare lists to store the extracted data
productName = [] # to store product names
productPrice = [] # to store product prices
seller = [] # to store seller names
city = [] # to store the city of the seller
unitSold = [] # to store the number of units sold
rating = [] # to store product ratings

# create driver variable filled with webdriver class
# since I'm using Edge, we need initialize a web driver for Microsoft Edge
driver = webdriver.Edge()

# loop through pages 2 to 13 to scrape data
for page in range(2, 13):
    # load the webpage for the current page number
    driver.get(f"https://www.tokopedia.com/search?navsource=&page={page}&q=seblak&search_id=202412110649080230A2C77D199C23EDP5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=")

    # pause for 2 seconds to allow the page to load completely
    time.sleep(2)

    # get the page's HTML content
    html = driver.page_source

    # parse the HTML content using BeautifulSoup package
    page = BeautifulSoup(html, "html.parser")

    # find all product-related elements on the page
    rows = page.find_all("div", {"class": "WABnq4pXOYQihv0hUfQwOg=="})

    # loop through each product element to extract details
    for row in rows:
        # extract specific details for each product
        namaProduk = row.find("span", {"class": "_0T8-iGxMpV6NEsYEhwkqEg=="})
        hargaProduk = row.find("div", {"class": "_67d6E1xDKIzw+i2D2L0tjw=="})
        penjual = row.find("span", {"class": "T0rpy-LEwYNQifsgB-3SQw=="})
        kotaToko = row.find("span", {"class": "pC8DMVkBZGW7-egObcWMFQ== flip"})
        banyakTerjual = row.find("span", {"class": "se8WAnkjbVXZNA8mT+Veuw=="})
        ratingProduk = row.find("span", {"class": "_9jWGz3C-GX7Myq-32zWG9w=="})

        # check if each detail exists. if it does, then extract its text content or add 'None' if unavailable
        if namaProduk != None:
            # macthing content to condition
            # create a new variable filled with whitespace-removed content using .strip() method
            content = namaProduk.get_text().strip()

            # if available, append to list
            productName.append(content)
        else:
            productName.append(None)

        if hargaProduk != None:
            productPrice.append(hargaProduk.get_text().strip())
        else:
            productPrice.append(None)

        if penjual != None:
            seller.append(penjual.get_text().strip())
        else:
            seller.append(None)

        if kotaToko != None:
            city.append(kotaToko.get_text().strip())
        else:
            city.append(None)

        if banyakTerjual != None:
            unitSold.append(banyakTerjual.get_text().strip())
        else:
            unitSold.append(None)

        if ratingProduk != None:
            rating.append(ratingProduk.get_text().strip())
        else:
            rating.append(None)

# create a DataFrame to organize the extracted data
df = pd.DataFrame()

# populate the DataFrame with extracted data
df['productName'] = productName
df['productPrice'] = productPrice
df['seller'] = seller
df['city'] = city
df['unitSold'] = unitSold
df['rating'] = rating

# save the extracted data to a CSV file
df.to_csv('result.csv', index=False)

```

# B. Data Preparation

## 1. Data Exploration

### 1.a. Showing Data

In [608]:
# import pandas for data exploration

import pandas as pd

In [609]:
# read the csv file

df = pd.read_csv('result.csv')

df.head()

Unnamed: 0,productName,productPrice,seller,city,unitSold,rating
0,Kerupuk Seblak Pedas 200gram,Rp19.000,,,100+ terjual,4.8
1,MAKARONI ASIN GURIH SPIRAL 2KG - 1 kg,Rp65.000,,,500+ terjual,5.0
2,BASRENG STIK PEDAS DAUN JERUK 200GR,Rp21.137,,,1rb+ terjual,4.9
3,Rasa Juara Seblak Original Instan | Seblak Jua...,Rp14.200,Wu Meyers Official Shop,Surabaya,18 terjual,5.0
4,Paket Isi 6 Cuanki Instan Lakoca | Latagor | S...,Rp93.840,Lakoca Official Shop,Cimahi,50+ terjual,4.9


Based on the first few rows shown above, it is observed that some of the columns are having structural problems. For instance, in the column productPrice, there are string of currency symbol found before the integer values. If we are going to analyze this dataset, this problem needs to be addressed during the cleaning stage.

Similar with productPrice column, unitSold also needs some cleaning. Units sold are supposed to be in integer value and not object.

### 1.b. Show Dataframe Information

In [610]:
# using .info to gather dataset structure

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   productName   140 non-null    object 
 1   productPrice  140 non-null    object 
 2   seller        110 non-null    object 
 3   city          110 non-null    object 
 4   unitSold      129 non-null    object 
 5   rating        115 non-null    float64
dtypes: float64(1), object(5)
memory usage: 6.7+ KB


It is observed that this dataset contains 140 rows and 6 columns with information about seblak products. It includes:

- Product Name and Price (fully populated)
- Seller and City (missing some entries, indicated by null-values present)
- Units Sold (missing 11 entries) and Rating (missing some entries, indicated by null-values present)
- Data types include mostly text (categorical) fields and one numerical column (Rating)

The data type of several series also need some conversion since they are not in the correct formatting. The series that we will convert in the later stage are:

- productPrice (to be converted to float)
- unitSold (to be converted to integer)

### 1.c. Cek Missing value

In [611]:
# check missing values/null count using .info() method
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   productName   140 non-null    object 
 1   productPrice  140 non-null    object 
 2   seller        110 non-null    object 
 3   city          110 non-null    object 
 4   unitSold      129 non-null    object 
 5   rating        115 non-null    float64
dtypes: float64(1), object(5)
memory usage: 6.7+ KB


It is observed from the summary provided that the dataset has substantial missing data, particularly in the unitSold and rating columns, which may affect quantitative analyses in the future. In summary, out of 140 entries, we can observe that:

- seller has 30 missing values
- city has  30 missing values
- unit Sold has 11 missing values
- Rating has 25 missing values


## 2. Data Cleaning

In [612]:
# using .info method to gather summary once again
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   productName   140 non-null    object 
 1   productPrice  140 non-null    object 
 2   seller        110 non-null    object 
 3   city          110 non-null    object 
 4   unitSold      129 non-null    object 
 5   rating        115 non-null    float64
dtypes: float64(1), object(5)
memory usage: 6.7+ KB


In [613]:
# Set option to display all rows
pd.set_option('display.max_rows', None)

df

Unnamed: 0,productName,productPrice,seller,city,unitSold,rating
0,Kerupuk Seblak Pedas 200gram,Rp19.000,,,100+ terjual,4.8
1,MAKARONI ASIN GURIH SPIRAL 2KG - 1 kg,Rp65.000,,,500+ terjual,5.0
2,BASRENG STIK PEDAS DAUN JERUK 200GR,Rp21.137,,,1rb+ terjual,4.9
3,Rasa Juara Seblak Original Instan | Seblak Jua...,Rp14.200,Wu Meyers Official Shop,Surabaya,18 terjual,5.0
4,Paket Isi 6 Cuanki Instan Lakoca | Latagor | S...,Rp93.840,Lakoca Official Shop,Cimahi,50+ terjual,4.9
5,kerupuk seblak pedas/kerupuk jablai 100gram,Rp8.000,Warung Snackk,Tangerang,16 terjual,5.0
6,Kerupuk Bawang Prima / Kerupuk Warna Warni / K...,Rp11.750,food Service Ingredients,Jakarta Timur,2 terjual,5.0
7,Baso urat djuara sami raos/Baso Aci Urat /Seblak,Rp28.000,Snack Zone Official,Jakarta Selatan,19 terjual,5.0
8,seblak kering RAFAEL mawar pedas daun jeruk 1K...,Rp32.000,Mega Snack 095,Cimahi,14 terjual,5.0
9,Seblak Instan Pedas Home Made,Rp3.500,the Dhecip,Tangerang Selatan,4rb+ terjual,4.9


When analyzing the unique values, we observed the presence of certain keywords unrelated to seblak, such as "Indomie" and "Tempat Seblak." While Indomie may feature a seblak-flavored variant, our analysis focuses solely on the performance of seblak in the market. Therefore, for the purposes of this analysis, these unrelated entries should be excluded.

In the following step, we will perform simple data cleaning using .lower() and drop via index methods. The reason why we are setting all productName series into lowercase is to ease us up when searching for certain keywords

In [614]:
# using .lower() method to lowercase values
df['productName'] = df['productName'].str.lower()

In [615]:
# dropping found irrelevant entries
df = df.drop([39,40,41,48,86], axis=0)


In [616]:
# showing the first five rows to check whether the change has been made
df.head()

Unnamed: 0,productName,productPrice,seller,city,unitSold,rating
0,kerupuk seblak pedas 200gram,Rp19.000,,,100+ terjual,4.8
1,makaroni asin gurih spiral 2kg - 1 kg,Rp65.000,,,500+ terjual,5.0
2,basreng stik pedas daun jeruk 200gr,Rp21.137,,,1rb+ terjual,4.9
3,rasa juara seblak original instan | seblak jua...,Rp14.200,Wu Meyers Official Shop,Surabaya,18 terjual,5.0
4,paket isi 6 cuanki instan lakoca | latagor | s...,Rp93.840,Lakoca Official Shop,Cimahi,50+ terjual,4.9


In [617]:
# re-read the dataframe structure
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 135 entries, 0 to 139
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   productName   135 non-null    object 
 1   productPrice  135 non-null    object 
 2   seller        108 non-null    object 
 3   city          108 non-null    object 
 4   unitSold      126 non-null    object 
 5   rating        115 non-null    float64
dtypes: float64(1), object(5)
memory usage: 7.4+ KB


After this first step of cleaning, we are left with 135 entries. These entries obviously are not clean yet. In the following step, we will clean productPrice series. It is observed that it has certain keywords that will mess up our conversion process in the later stage. These characters will be taken out and we will convert its format into float.

In [618]:
# replace the string '.' and 'Rp' to nothing
df['productPrice'] = df['productPrice'].str.replace('.', '').str.replace('Rp', '')

In [619]:
# re-check the first few rows to ensure if those characters have been taken out
df.head()

Unnamed: 0,productName,productPrice,seller,city,unitSold,rating
0,kerupuk seblak pedas 200gram,19000,,,100+ terjual,4.8
1,makaroni asin gurih spiral 2kg - 1 kg,65000,,,500+ terjual,5.0
2,basreng stik pedas daun jeruk 200gr,21137,,,1rb+ terjual,4.9
3,rasa juara seblak original instan | seblak jua...,14200,Wu Meyers Official Shop,Surabaya,18 terjual,5.0
4,paket isi 6 cuanki instan lakoca | latagor | s...,93840,Lakoca Official Shop,Cimahi,50+ terjual,4.9


In [620]:
# convert productPrice into float
df['productPrice'] = df['productPrice'].astype(float)

In [621]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 135 entries, 0 to 139
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   productName   135 non-null    object 
 1   productPrice  135 non-null    float64
 2   seller        108 non-null    object 
 3   city          108 non-null    object 
 4   unitSold      126 non-null    object 
 5   rating        115 non-null    float64
dtypes: float64(2), object(4)
memory usage: 7.4+ KB


After observing the seller and city columns, we noticed that they share similar unique values, suggesting a possible dependency between the two. In addition, upon reviewing the original Tokopedia page, we discovered that these entries are advertisements and do not have actual seller or city information.

We decided to exclude these values because the next part of the analysis focuses on comparing product performance between Jabodetabek and other regions. Including these entries could potentially distort the results.

In [622]:
# drop rows where the 'seller' column has NaN values
df = df.dropna(subset=['seller'])

# resurface the dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 108 entries, 3 to 139
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   productName   108 non-null    object 
 1   productPrice  108 non-null    float64
 2   seller        108 non-null    object 
 3   city          108 non-null    object 
 4   unitSold      99 non-null     object 
 5   rating        99 non-null     float64
dtypes: float64(2), object(4)
memory usage: 5.9+ KB


Upon examining the unitSold and rating, we also noticed that unitSold and rating have missing values. Since we plan to include rating in calculations like mean, median, or other statistical summaries, it's best to handle the NaN values.

We decided that it's reasonable to remove rows with NaN in rating and unitSold for the following reasons:

- To ensure accurate statistical results without the need for imputation; and
- The fact that can't estimase the NaN values for this case.

In [623]:
# drop rows where the 'unitSold' column has NaN values
df = df.dropna(subset=['unitSold'])

# drop rows where the 'rating' column has NaN values
df = df.dropna(subset=['rating'])

# resurface the dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 98 entries, 3 to 138
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   productName   98 non-null     object 
 1   productPrice  98 non-null     float64
 2   seller        98 non-null     object 
 3   city          98 non-null     object 
 4   unitSold      98 non-null     object 
 5   rating        98 non-null     float64
dtypes: float64(2), object(4)
memory usage: 5.4+ KB


Now, we observed that unitSold is in object form and has several characters that will make it impossible to transform its data type into integer.

In this step, we are going to remove all characters and leave only numeric values within the series. After that, we will convert the series into integer data type.


In [624]:
# find unique value to see the characters needed to be removed
df['unitSold']. unique()

array(['18 terjual', '50+ terjual', '16 terjual', '2 terjual',
       '19 terjual', '14 terjual', '4rb+ terjual', '30+ terjual',
       '250+ terjual', '1 terjual', '24 terjual', '2rb+ terjual',
       '100+ terjual', '80+ terjual', '500+ terjual', '60+ terjual',
       '70+ terjual', '90+ terjual', '5 terjual', '1rb+ terjual',
       '21 terjual', '6 terjual', '10 terjual', '4 terjual', '15 terjual',
       '22 terjual', '23 terjual', '26 terjual', '7 terjual',
       '29 terjual', '13 terjual'], dtype=object)

In [625]:
# replace the string 'terjual' and '+' and 'rb' and ' '
df['unitSold'] = df['unitSold'].str.replace('+', '').str.replace('rb', '000').str.replace('terjual','').str.replace(' ','')

# convert data type into int
df['unitSold'] = df['unitSold'].astype(int)

# show the first fifteen rows
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 98 entries, 3 to 138
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   productName   98 non-null     object 
 1   productPrice  98 non-null     float64
 2   seller        98 non-null     object 
 3   city          98 non-null     object 
 4   unitSold      98 non-null     int64  
 5   rating        98 non-null     float64
dtypes: float64(2), int64(1), object(3)
memory usage: 5.4+ KB


Now, we worry that these entries may have duplicate values. Even if the website where we are scrapping is an e-commerce, there's a possibility that they show similar items over and over again. Therefore, it's best to remove the duplicate values.

In [626]:
# check the sum of duplicate values

df_duplicate = df.duplicated().sum()
df_duplicate

np.int64(13)

Since we found duplicate values, we will remove them. In the following step, we utilize duplicate function to drop all similar rows

In [627]:
df = df.drop_duplicates(ignore_index=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   productName   85 non-null     object 
 1   productPrice  85 non-null     float64
 2   seller        85 non-null     object 
 3   city          85 non-null     object 
 4   unitSold      85 non-null     int64  
 5   rating        85 non-null     float64
dtypes: float64(2), int64(1), object(3)
memory usage: 4.1+ KB


Now, we are left with 85 clean entries ready to be analyzed.

# C. Business Understanding and Problem Statement

We will be using the SMART analysis to determine and breakdown the business problem we want to answer.

### **SPECIFIC**

Increase monthly income by evaluating optimum sales potential of the trending seblak product.


### **MEASURABLE**

Generating another 50% of my current salary to my pocket.

### **ACHIEVABLE**

By leveraging sales data from Tokopedia and applying the data analysis skills I've learned in bootcamp, such as Python and data visualization, I can find the range of potential income I can get.

### **RELEVANT**

Make a well-informed decision about selecting seblak as the product for my dropshipping business. With limited promotional resources, it is crucial to ensure that I invest in a product with high sales potential.

### **TIME-BOUND**

Within 30 days, beginning from web scraping process until insight finding, I can launch my dropshipping business and generate another income.

## Problem Statement

Generate another stream of income within 30 days in order to increase as much as 50% of my current salary by utilizing the skills learned in Bootcamp.

# D. Analysis

## 1. Mean, Median, Std, Skewness, and Kurtosis

To understand the characteristics and trends of seblak products, we conducted a statistical analysis on three key columns: productPrice, unitSold, and rating. The calculations include the mean, median, standard deviation, skewness, and kurtosis for each column. Here's the story behind these metrics and the insights they reveal:

**Mean and Median**: The mean indicates the general cost of seblak products or units sold or rating, while the median provides a measure less affected by extreme values. 

**Standard Deviation**: A high standard deviation would imply that prices, sales number, or rating vary greatly across products, potentially reflecting a range of quality or portion sizes.

**Skewness and Kurtosis**: Positive skewness would indicate a concentration in the lower values with a few high value ones driving the tail. High kurtosis would highlight the presence of extreme outliers, such as premium seblak.

The details of our statistical analysis will be shown below:



#### Statistical Analysis for Product's Price

In [628]:
# find the the central tendency mean
print('Product Price in Average is Rp:',df['productPrice'].mean())

# find the central tendency median
print('Median of Product Price is Rp:',df['productPrice'].median())

# define a new variable for standard deviation
std = df['productPrice'].std()

# print standard deviation variable
print('Standard Deviation of product price is Rp:',std)

# find the skewness in the distribution
print(f"Skewness: {df['productPrice'].skew()}")

# find the kurtosis in the distribution
print(f"Kurtosis: {df['productPrice'].kurtosis()}")

Product Price in Average is Rp: 21739.49411764706
Median of Product Price is Rp: 14999.0
Standard Deviation of product price is Rp: 20232.459359384156
Skewness: 2.4332894175338535
Kurtosis: 7.289255193435832


Analysis:

Based on the above, we can surmise that:

- The average price of all 98 products in the dataset is around Rp23,191. This is significantly higher than the median (Rp15,000) price or, in other words, Rp8,000 higher. This indicates the presence of outliers in our dataset since average is highly influenced by how much and how vary the outliers are.

- The median price is Rp15,000 which means that if we were to sort all of the seblak prices, from lower to higher, then the middle point of all these prices is Rp15,000. The median price is more reliable in this case, since we have noticed previously that our dataset may contain several high outliers.

- The standard deviation ("std") represents how far or how much of a spread the prices are around the mean value (Rp23,191). Since the std in the product price column falls at Rp21,526, this indicates that our prices vary widely, could be higher (around Rp44,000) or lower (Rp 2,000) from the average value. This also proves that the mean/average prcie is not a great representation of what most of our products cost.

- The skewness level of 2.2 as represented above means that our data is highly skewed towards positive skewness. This means that there are outliers in the more expensive range, and that if it were to represented in a graph, it will have a long tail.

- The kurtosis of 5.5 means that the data's kurtosis type is leptokurtic which means that it has more heavy tail and a higher peak than a normal distribution. This also proves that our dataset has extreme outliers.

Conclusion:

Based on all of the statistical analysis measures, we can conclude that product price is not normally distributed, and has extreme outliers in the more expensive range (positive skew) of the series.


#### Statistical Analysis for unitSold

In [629]:
# find the the central tendency mean
print('Unit Sold in Average is:',df['unitSold'].mean(),'items')

# find the central tendency median
print('Median of Unit Sold is:',df['unitSold'].median(),'items')

# define a new variable for standard deviation
std = df['unitSold'].std()

# print standard deviation variable
print('Standard Deviation of units sold is:',std)

# find the skewness in the distribution
print(f"Skewness: {df['unitSold'].skew()}")

# find the kurtosis in the distribution
print(f"Kurtosis: {df['unitSold'].kurtosis()}")

Unit Sold in Average is: 345.32941176470587 items
Median of Unit Sold is: 60.0 items
Standard Deviation of units sold is: 815.5100418089997
Skewness: 3.5832608876055705
Kurtosis: 13.049139391291117


Analysis:

Based on the above, we can surmise that:

- The average units sold of all items in the dataset is around 308 units. This is significantly higher than the median (55) items sold or, in other words, 253 units higher. This indicates the extreme presence of outliers in our dataset since average is highly influenced by how much and how vary the outliers are.

- The median of units sold is 55 which means that if we were to sort all of the units sold values, from lower to higher, then the middle point of all these items sold is 55. The median price is more reliable in this case, since we have noticed previously that our dataset may contain several high outliers.

- The standard deviation ("std") represents how far or how much of a spread the units sold are around the mean value (308). Since the std in the product price column falls at 765, this indicates that the items sold vary very widely, could be higher or far lower from the average value. This also proves that the mean/average units sold is not a great representation of how well the performance of seblak's sale.

- The skewness level of 3.8 as represented above means that our data is highly skewed towards positive skewness. This means that there are outliers in the more high sold range, and that if it were to represented in a graph, it will have a long tail.

- The kurtosis of 15.5 means that the data's kurtosis type is leptokurtic which means that it has more heavy tail and a higher peak than a normal distribution. This also proves that our dataset has extreme outliers.

Conclusion:

Based on all of the statistical analysis measures, we can conclude that units sold series data is not normally distributed, and has extreme outliers in the more positive range (posivive skew) of the series.


#### Statistical Analysis for Rating

In [630]:
# find the the central tendency mean
print('Rating in Average is:',df['rating'].mean(),'stars')

# find the central tendency median
print('Median of Rating is Rp:',df['rating'].median(),'stars')

# define a new variable for standard deviation
std = df['rating'].std()

# print standard deviation variable
print('Standard Deviation of rating is',std)

# find the skewness in the distribution
print(f"Skewness: {df['rating'].skew()}")

# find the kurtosis in the distribution
print(f"Kurtosis: {df['rating'].kurtosis()}")

Rating in Average is: 4.876470588235295 stars
Median of Rating is Rp: 4.9 stars
Standard Deviation of rating is 0.21802526001727932
Skewness: -3.466784554221262
Kurtosis: 13.570275661261157


Analysis:

Based on the above, we can surmise that:

- The average of rating in the dataset is around 4.88. This is higher than the median (4.9) rating and it indicates the presence of outliers in our dataset since average is highly influenced by how much and how vary the outliers are.

- The median of rating is 4.9 which means that if we were to sort all of the rating values, from lower to higher, then the middle point of all these rating is 4.9. The median rating is more reliable in this case, since we have noticed previously that our dataset may contain several outliers.

- The standard deviation ("std") represents how far or how much of a spread the rating are around the mean value (4.88). Since the std in the rating column falls at 0.2, this indicates that the rating vary widely, could be 0.2 higher or 0.2 lower from the average value.

- The skewness level of -3 as represented above means that our data is highly skewed towards negative skewness. This means that there are outliers in the more of the lower rating, and that if it were to represented in a graph, it will have a long front dragging line.

- The kurtosis of 15.4 means that the data's kurtosis type is leptokurtic which means that it has more heavy tail and a higher peak than a normal distribution. This also proves that our dataset has extreme outliers.

Conclusion:

Based on all of the statistical analysis measures, we can conclude that rating series data is not normally distributed, and has extreme outliers in the more negative range (negative skew) of the series.


## 2. Confidence Interval

The question is asking us to find out the minimum and maximum range within which the monthly income from selling seblak is likely to fall. This will be determined by defining the lower and upper bounds or limit of confidence interval.


We first and foremost will assume that the data follows normal distribution. By assumin that the data is not skewed to left or right, we then can use confidence interval. Otherwise, other method will have to be used in order to represent a more accurate calculation.

Secondly, we will assume that the information or dataset provided is for a monthly seblak sale performance.

In [631]:
# import packages
import numpy as np
from scipy import stats

In [632]:
# define a new variable to contain the multiplication of unitSold and productPrice
# the new variable defined as our potential income
potentialIncome = df['unitSold'] * df['productPrice']

# calculate standard deviation from potential income
# this is necessary to calculate confidence interval
std = potentialIncome.std()

# create new variable to count the total number of rows
N = len(df)

# calculate confidence interval using variables created
# assigned confidence level is 95%
low, up = stats.norm.interval(0.95,loc=potentialIncome.mean(),scale=std/np.sqrt(N))

print('Lower Limit:',low)
print('Upper Limit:',up)


Lower Limit: 2684361.520364147
Upper Limit: 7850209.820812324


It is observed from the confidence interval calculation above that should we run our seblak dropship business, the minimum range where true mean of potential income likely lies is around IDR2,684,361 and the maximum mean of potential income is around IDR7,850,209.

As an addition, it is crucial to understand that this is different than if we were getting the minimum and maximum of possible income by calculating the MIN and MAX value directly out of potential income. Finding the MIN and MAX directly will result in a skewed analysis, since we are only getting the most extreme values out of the dataset. However, with confidence interval, we get the accurate result of how much we can actually earn by finding the average several lower values and the average of several upper values.

## 3. Hypothesis Test

In [633]:
# find the unique values of city to filter out Jabodetabek
df['city'].unique()

array(['Surabaya', 'Cimahi', 'Tangerang', 'Jakarta Timur',
       'Jakarta Selatan', 'Tangerang Selatan', 'Kab.Ciamis',
       'Kab. Bandung', 'Kab. Garut', 'Bandung', 'Kab. Tangerang',
       'Denpasar', 'Jakarta Barat', 'Kab. Majalengka', 'Jakarta Pusat',
       'Kab. Cianjur', 'Jakarta Utara', 'Kab. Bogor', 'Kab. Malang',
       'Bekasi', 'Semarang', 'Medan', 'Bogor', 'Makassar', 'Kab. Bekasi',
       'Malang', 'Depok'], dtype=object)

In [None]:
# create a list for the cities in Jabodetabek area
jabodetabek = ['Tangerang', 'Jakarta Timur', 'Jakarta Selatan', 'Tangerang Selatan',
               'Kab. Tangerang', 'Jakarta Barat', 'Jakarta Pusat','Jakarta Utara',
               'Kab. Bogor', 'Bekasi', 'Bogor', 'Kab. Bekasi', 'Depok']

# create a new series and bucket city into two categories: Jabodetabek and Non-Jabodetabel
df['cityCategory'] = df['city'].apply(lambda x:
                                "Jabodetabek" if x in jabodetabek else
                                "Luar Jabodetabek")

# create new variable to contain the list of product price in Jabodetabek
priceJabodetabek = df[df['cityCategory'] == "Jabodetabek"]['productPrice']

# create new variable to contain the list of product price in the outside of Jabodetabek area
priceNonJabodetabek = df[df['cityCategory'] == "Luar Jabodetabek"]['productPrice']


The hypothesis for this case:

- H0 = There is no significance of price difference between the price in Jabodetabek and outside of Jabodetabek

- H1 = There is a significant difference between the price in Jabodetabek and outside of Jabodetabek

In [645]:
# calculate t-statistic and p-value
t_stat, p_val = stats.ttest_ind(priceJabodetabek,priceNonJabodetabek)
print(f'T-Statistic: {t_stat}')
print(f'P-value: {p_val}')

T-Statistic: -2.105580946859908
P-value: 0.03826303604617727


On the basis of the p-value calculation (0.03), we hereby reject the H0 hypothesis and conclude that there is a significant difference between the price in Jabodetabek and outside of Jabodetabek. The p-value that falls under the alpha 0.05 will automatically reject H0 hypothesis.

In addition, based on the negative T-Statistic, it also shows that prices in Jabodetabek are lower than those outside Jabodetabek

## 4. Correlation between Price and Rating

In the following step, we aim to examine whether price has any effect on customer satisfaction. To do this, we use two indicators: price and rating. We chose ratings over units sold because ratings directly reflect customer feedback, making them a more reliable measure of satisfaction.

In contrast, units sold can be influenced by external factors like discounts or brand popularity, which do not always correlate with customer satisfaction. By focusing on ratings, we aim to gain a more accurate understanding of the relationship between price and customer satisfaction.

For the correlation analysis, we are using the Spearman correlation analysis, as opposed to Pearson and Kendall. Spearman analysis is a good fit to find the correlation between two measures since product price and rating--as calculated in the previous chapter--are not normally distributed.

In [646]:
# find rho-chorrelation and p-value using spearman
corr_rho, pval_s = stats.spearmanr(df['rating'], df['productPrice'])

print(f"rho-correlation: {corr_rho:.2f}")
print(f"p-value: {pval_s}")

rho-correlation: -0.07
p-value: 0.548482250522859


We know that if p-value is more than 0.05, the correlation between two variables is by chance or not significant. Since our p-value is 0.5, that means that there is very little correlation between product price and rating. Or in other words, if a seblak is cheap, it does not mean that the rating will be high, or vice versa.

# Conclusion

In the Business Understanding and Problem Statement stage, we identified the primary goal of this analysis: _to generate income as much as 50% of current salary by evaluating the sales potential of seblak._

On our confidence interval calculations, we conclude that if we proceed with a seblak dropshipping business, the estimated minimum potential income is approximately IDR 2,684,361. Meanwhile, the maximum potential income is around IDR 7,850,209. These figures assume that the dataset we scraped from Tokopedia represents a monthly summary of seblak prices.

To put this into perspective, if the current Provincial Minimum Wage (UMP) is IDR 5.9 million per month, the minimum potential income from dropshipping seblak would amount to roughly 44% of the UMP.

Furthermore, we also evaluated the potential of seblak dropshipping based on customer satisfaction, measured by product ratings. Our analysis shows an average rating of 4.88 stars, with a median of 4.9 stars, out of a maximum of 5.

Both of the descriptions above lead us to a conclusion: **dropshipping seblak could be a viable option to generate another 50% of my income. The consistently high ratings that indicates strong customer satisfaction also further supporting the fact that dropshipping seblak could be a promising dropshipping business.**