===============================================================

Project Name: Monitors product sales and explores market insights of Tokopedia sales using web-scraping techniques.  
Author : Fazrin Muhammad 

===============================================================

# **A. Phase 1: Business Understanding**

## **Problem Statement**  
The challenge revolves around increasing income through product sales while facing capital constraints for production. With limited funds allocated solely for promotion, the decision to engage in a dropshipping scheme on the Tokopedia platform arises. However, the uncertainty regarding the choice of product arises, particularly considering the viral trend of seblak. The hesitation stems from the lack of clarity regarding consumer interest in seblak, prompting the need for data-driven insights. Despite possessing skills as a data analyst, the primary challenge lies in the absence of comprehensive data, except for what is publicly available on the Tokopedia e-commerce platform.

## **SMART Business Understanding Framework**  
- **Specific:** Increasing revenue through selling seblak using the dropship scheme on the Tokopedia platform.
- **Measurable:** Measuring GMV values weekly/monthly and tracking the conversion from promotions to profits.
- **Achievable:** With the planned strategy, achieving a profit margin of up to 15% is a feasible goal.
- **Relevant:** Measuring GMV helps understand how much income is generated through the conversion process from promotions.
- **Time-bound:** The target must be achieved within the next 3 months to reach a profit margin of up to 15%.

# **B. Phase 2: Web Scraping**

## Performing web scraping on Tokopedia's page

In [12]:
# import Module
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pandas as pd

driver = webdriver.Chrome()

# Creating empty containers to store information from several required pages
nama_produk = []
harga_produk = []
penjual_produk = []
kota_toko = []
total_terjual = []
rating_produk = []

# Creating a loop to retrieve information from several required pages
for i in range(1,15):
    url = "https://www.tokopedia.com/search?navsource=&page={}&q=seblak&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=".format(i)
    driver.get(url)
    time.sleep(1)
    html = driver.page_source                                   # Save site's html into variable
    soup = BeautifulSoup(html, 'html.parser')                   # Menguraikan string html

    # Collecting product boxes
    boxes = soup.find_all('div', {'class':'css-1asz3by'})

    # Empty containers to store information per page
    list_nama = []
    list_harga = []
    list_penjual = []
    list_kota = []
    list_terjual = []
    list_rating = []

    # Creating a loop to retrieve each product box per page
    for box in boxes :
        # Retrieving product names
        try:
            nama = box.find('div', {'data-testid':'spnSRPProdName'})
            list_nama.append(nama.get_text())
        except:
            list_nama.append(None)

        # Retrieving product prices
        try:
            harga = box.find('div', {'data-testid':'spnSRPProdPrice'})
            list_harga.append(harga.get_text())
        except:
            list_harga.append(None) 

        # Retrieving seller names
        try:
            penjual = box.find('span', {'class':'prd_link-shop-name css-1kdc32b flip'})
            list_penjual.append(penjual.get_text())
        except:
            list_penjual.append(None) 

        # Retrieving city names
        try:
            kota = box.find('span', {'class':'prd_link-shop-loc css-1kdc32b flip'})
            list_kota.append(kota.get_text())
        except:
            list_kota.append(None)

        # Retrieving the number of products sold
        try:
            terjual = box.find('span', {'class':'prd_label-integrity css-1sgek4h'})
            list_terjual.append(terjual.get_text())
        except:
            list_terjual.append(None)

        # Retrieving the number of product ratings
        try:
            rating = box.find('span', {'class':'prd_rating-average-text css-t70v7i'})
            list_rating.append(rating.get_text())
        except:
            list_rating.append(None)

    # Removing the first 3 lines
    list_nama = list_nama[3:]
    list_harga = list_harga[3:]
    list_penjual = list_penjual[3:]
    list_kota = list_kota[3:]
    list_terjual = list_terjual[3:]
    list_rating = list_rating[3:]

    # Adding page information to the container
    nama_produk += list_nama
    harga_produk += list_harga
    penjual_produk += list_penjual
    kota_toko += list_kota
    total_terjual += list_terjual
    rating_produk += list_rating

driver.quit()

Description:  
The program described above is designed to extract essential information from Tokopedia pages related to 'seblak'. The extracted information includes the product name, price, seller name, store location, total sales, and product rating. The program workflow starts by fetching data from page 1 and subsequently analyzing for any irregularities. Following this, data retrieval is extended up to page 14. However, during the data extraction process on page 1, anomalies were detected, particularly within the first 3 lines, indicating missing data. This anomaly is attributed to promotional content on the Tokopedia page. Consequently, data cleaning procedures were implemented to remove the first 3 lines from each page.

In [13]:
# Checking the number of entries retrieved for each column
print(len(nama_produk))
print(len(harga_produk))
print(len(penjual_produk))
print(len(kota_toko))
print(len(total_terjual))
print(len(rating_produk))

484
484
484
484
484
484


Description:  
The number of data in each column is consistent, totaling 484 entries.

In [14]:
# Create Data Frame
df = pd.DataFrame({
    'nama': nama_produk,
    'harga':harga_produk,
    'penjual':penjual_produk,
    'kota':kota_toko,
    'total_terjual':total_terjual,
    'rating':rating_produk})

df

Unnamed: 0,nama,harga,penjual,kota,total_terjual,rating
0,Seblak Instan Ceu Nthien Khas Bandung Rasana N...,Rp17.000,Central Seblak Nusantara,Tangerang Selatan,2rb+ terjual,4.9
1,Kylafood Seblak Cup,Rp11.000,Kylafood Jakarta,Jakarta Selatan,100+ terjual,5.0
2,Kylafood Seblak Rempah Autentik,Rp10.000,Kylafood Jakarta,Jakarta Selatan,100+ terjual,5.0
3,3 BASO ACI FREE SEBLAK SAJODO SNACK & FOOD,Rp63.000,Sajodo Snack & Food,Tasikmalaya,40+ terjual,5.0
4,Seblak instant murah,Rp4.900,dapurbulbin,Bandung,,
...,...,...,...,...,...,...
479,SEBRING KRUPUK KERUPUK SEBLAK KERING PEDAS DAU...,Rp16.000,Aydaa Snack,Surakarta,90+ terjual,4.9
480,Gelifood Seblak Instan Kerupuk Mawar Bumbu Ken...,Rp15.000,Lidigeli,Kab. Garut,250+ terjual,4.8
481,Kerupuk Seblak Rafael Pedas / Seblak Mawar Ped...,Rp16.000,DUO BOCIL SNACK,Depok,100+ terjual,5.0
482,seblak basreng pedas original exstra daun jeru...,Rp25.000,Seblak putra bandung,Kab. Bandung,250+ terjual,4.8


Description:  
Creation of a dataframe (df) consisting of columns: name, price, seller, city, total_sold, rating.

In [17]:
# Saving the retrieved data to a .csv file
df.to_csv('rawdata.csv', index=False) 

Description:  
The previously extracted data is saved in .csv format. This is done to anticipate any changes to the retrieved data due to Tokopedia's system. This dataset is referred to as raw data.

# **C. Phase 3: Data Preparation**

## Loading raw data

In [None]:
# import pandas
import pandas as pd

# Loading raw data
df = pd.read_csv('rawdata.csv')
df

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,nama,harga,penjual,kota,total_terjual,rating
0,Seblak Instan Ceu Nthien Khas Bandung Rasana N...,Rp17.000,Central Seblak Nusantara,Tangerang Selatan,2rb+ terjual,4.9
1,Kylafood Seblak Cup,Rp11.000,Kylafood Jakarta,Jakarta Selatan,100+ terjual,5.0
2,Kylafood Seblak Rempah Autentik,Rp10.000,Kylafood Jakarta,Jakarta Selatan,100+ terjual,5.0
3,3 BASO ACI FREE SEBLAK SAJODO SNACK & FOOD,Rp63.000,Sajodo Snack & Food,Tasikmalaya,40+ terjual,5.0
4,Seblak instant murah,Rp4.900,dapurbulbin,Bandung,,
...,...,...,...,...,...,...
479,SEBRING KRUPUK KERUPUK SEBLAK KERING PEDAS DAU...,Rp16.000,Aydaa Snack,Surakarta,90+ terjual,4.9
480,Gelifood Seblak Instan Kerupuk Mawar Bumbu Ken...,Rp15.000,Lidigeli,Kab. Garut,250+ terjual,4.8
481,Kerupuk Seblak Rafael Pedas / Seblak Mawar Ped...,Rp16.000,DUO BOCIL SNACK,Depok,100+ terjual,5.0
482,seblak basreng pedas original exstra daun jeru...,Rp25.000,Seblak putra bandung,Kab. Bandung,250+ terjual,4.8


In [2]:
# Check 15 of data head
df.head(15)

Unnamed: 0,nama,harga,penjual,kota,total_terjual,rating
0,Seblak Instan Ceu Nthien Khas Bandung Rasana N...,Rp17.000,Central Seblak Nusantara,Tangerang Selatan,2rb+ terjual,4.9
1,Kylafood Seblak Cup,Rp11.000,Kylafood Jakarta,Jakarta Selatan,100+ terjual,5.0
2,Kylafood Seblak Rempah Autentik,Rp10.000,Kylafood Jakarta,Jakarta Selatan,100+ terjual,5.0
3,3 BASO ACI FREE SEBLAK SAJODO SNACK & FOOD,Rp63.000,Sajodo Snack & Food,Tasikmalaya,40+ terjual,5.0
4,Seblak instant murah,Rp4.900,dapurbulbin,Bandung,,
5,"Seblak Rafael, Seblak Coet Instan Halal",Rp25.000,Brother Meat Shop,Depok,250+ terjual,5.0
6,"Kylafood Seblak Rempah Autentik 115gr, Seblak ...",Rp9.900,Brother Meat Shop,Depok,100+ terjual,4.9
7,Seblak Instan Pedas Home Made,Rp3.500,the Dhecip,Tangerang Selatan,3rb+ terjual,4.9
8,KERUPUK SEBLAK PEDAS DAUN JERUK 100 GR,Rp4.500,BociKakang,Jakarta Selatan,500+ terjual,4.8
9,Kylafood Seblak Cup,Rp11.000,Kylafood Jakarta,Jakarta Selatan,100+ terjual,5.0


Description:  
From the table above represents a sample of the first 15 entries retrieved from the previous process. It was found that there is 1 missing value for the product 'seblak instan murah', where the total_sold and rating columns are filled with NaN. Upon checking on the Tokopedia page, this is because the product is still new, hence no sales or ratings have been given yet. Additionally, for the purpose of analyzing the data in the price column, total_sold also requires adjustment. The data types in each column and missing values in the data will be checked first.

In [3]:
# Check information data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 484 entries, 0 to 483
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   nama           484 non-null    object 
 1   harga          484 non-null    object 
 2   penjual        484 non-null    object 
 3   kota           484 non-null    object 
 4   total_terjual  475 non-null    object 
 5   rating         470 non-null    float64
dtypes: float64(1), object(5)
memory usage: 22.8+ KB


Description:  
From the table above, it is evident that this dataset consists of a total of 484 entries and 6 columns, comprising name, price, seller, city, total_sales, and rating. The data types also need adjustment, particularly for the price and total_sales columns, which are currently in object/string format. These columns should ideally have integer/float data types since the data is numerical. Additionally, it was observed that the number of entries in the total_sales and rating columns does not match the total number of entries in the dataset. Further analysis will be conducted to address this inconsistency.

In [4]:
# Check missing value
df.isna().sum()


nama              0
harga             0
penjual           0
kota              0
total_terjual     9
rating           14
dtype: int64

Description:  
Based on the previous findings where the total_sales and rating columns have a different number of entries compared to the total number of entries in the dataset, this inconsistency is caused by the presence of missing values in these columns. Therefore, adjustments will be made to handle these missing values.

## Data Cleaning

### Converting data types and adjusting data format

In [5]:
# Adjusting data format and data type for the harga column
df["harga"] = df["harga"].str.replace("Rp", "").str.replace(".", "").astype(int)

In [6]:
# Adjusting data format for the total_terjual column
df["total_terjual"] = df["total_terjual"].str.replace("rb","000").str.replace("+","").str.replace(" terjual", "")

In [7]:
# Check information data
df.info()

# Check data
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 484 entries, 0 to 483
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   nama           484 non-null    object 
 1   harga          484 non-null    int32  
 2   penjual        484 non-null    object 
 3   kota           484 non-null    object 
 4   total_terjual  475 non-null    object 
 5   rating         470 non-null    float64
dtypes: float64(1), int32(1), object(4)
memory usage: 20.9+ KB


Unnamed: 0,nama,harga,penjual,kota,total_terjual,rating
0,Seblak Instan Ceu Nthien Khas Bandung Rasana N...,17000,Central Seblak Nusantara,Tangerang Selatan,2000,4.9
1,Kylafood Seblak Cup,11000,Kylafood Jakarta,Jakarta Selatan,100,5.0
2,Kylafood Seblak Rempah Autentik,10000,Kylafood Jakarta,Jakarta Selatan,100,5.0
3,3 BASO ACI FREE SEBLAK SAJODO SNACK & FOOD,63000,Sajodo Snack & Food,Tasikmalaya,40,5.0
4,Seblak instant murah,4900,dapurbulbin,Bandung,,
...,...,...,...,...,...,...
479,SEBRING KRUPUK KERUPUK SEBLAK KERING PEDAS DAU...,16000,Aydaa Snack,Surakarta,90,4.9
480,Gelifood Seblak Instan Kerupuk Mawar Bumbu Ken...,15000,Lidigeli,Kab. Garut,250,4.8
481,Kerupuk Seblak Rafael Pedas / Seblak Mawar Ped...,16000,DUO BOCIL SNACK,Depok,100,5.0
482,seblak basreng pedas original exstra daun jeru...,25000,Seblak putra bandung,Kab. Bandung,250,4.8


Description:  
The price column has been successfully conditioned, where the data type for this column is now integer and the data format is purely numeric. As for the total_sales column, the values are all numeric, but there are still missing values. Why wasn't the data type of the total_sales column changed at the same time? This is because changing the data type would result in an error due to the presence of missing values.

### Handling missing value

In [8]:
# Replacing NaN values with 0
df = df.fillna(0)

# Check missing value
df.isna().sum()

nama             0
harga            0
penjual          0
kota             0
total_terjual    0
rating           0
dtype: int64

Description:  
Based on the data exploration results regarding missing values, NaN values were found in the total_sales and rating columns, which are considered missing values. Upon checking on the Tokopedia page, this is due to the presence of new stores that have not made any sales or received ratings yet. Therefore, we can fill in these missing values with a value of 0.

In [9]:
# Change the data type of the total_terjual column to sold
df["total_terjual"] = df["total_terjual"].astype(int)

df.info()
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 484 entries, 0 to 483
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   nama           484 non-null    object 
 1   harga          484 non-null    int32  
 2   penjual        484 non-null    object 
 3   kota           484 non-null    object 
 4   total_terjual  484 non-null    int32  
 5   rating         484 non-null    float64
dtypes: float64(1), int32(2), object(3)
memory usage: 19.0+ KB


Unnamed: 0,nama,harga,penjual,kota,total_terjual,rating
0,Seblak Instan Ceu Nthien Khas Bandung Rasana N...,17000,Central Seblak Nusantara,Tangerang Selatan,2000,4.9
1,Kylafood Seblak Cup,11000,Kylafood Jakarta,Jakarta Selatan,100,5.0
2,Kylafood Seblak Rempah Autentik,10000,Kylafood Jakarta,Jakarta Selatan,100,5.0
3,3 BASO ACI FREE SEBLAK SAJODO SNACK & FOOD,63000,Sajodo Snack & Food,Tasikmalaya,40,5.0
4,Seblak instant murah,4900,dapurbulbin,Bandung,0,0.0
...,...,...,...,...,...,...
479,SEBRING KRUPUK KERUPUK SEBLAK KERING PEDAS DAU...,16000,Aydaa Snack,Surakarta,90,4.9
480,Gelifood Seblak Instan Kerupuk Mawar Bumbu Ken...,15000,Lidigeli,Kab. Garut,250,4.8
481,Kerupuk Seblak Rafael Pedas / Seblak Mawar Ped...,16000,DUO BOCIL SNACK,Depok,100,5.0
482,seblak basreng pedas original exstra daun jeru...,25000,Seblak putra bandung,Kab. Bandung,250,4.8


Description:  
There are no missing values anymore, and the data type of each column is now appropriate.

### Handling Data Duplicate

In [10]:
# Drop duplicate data
df_cleaned = df.drop_duplicates(subset=['nama'], keep='first')
df_cleaned

Unnamed: 0,nama,harga,penjual,kota,total_terjual,rating
0,Seblak Instan Ceu Nthien Khas Bandung Rasana N...,17000,Central Seblak Nusantara,Tangerang Selatan,2000,4.9
1,Kylafood Seblak Cup,11000,Kylafood Jakarta,Jakarta Selatan,100,5.0
2,Kylafood Seblak Rempah Autentik,10000,Kylafood Jakarta,Jakarta Selatan,100,5.0
3,3 BASO ACI FREE SEBLAK SAJODO SNACK & FOOD,63000,Sajodo Snack & Food,Tasikmalaya,40,5.0
4,Seblak instant murah,4900,dapurbulbin,Bandung,0,0.0
...,...,...,...,...,...,...
454,baso aci / seblak instan,6000,Nyum_mama,Pekanbaru,40,4.6
455,CUANKIE SIOMAY ISI 50 TOPPING BASO ACI / SEBLAK,18000,Pusat Cuankie & Cemilan Frozen,Jakarta Timur,29,4.5
456,150 Pcs CUANKI SIOMAY Kering Topping Pelengkap...,34500,Endo Shop ID,Kab. Garut,30,5.0
457,Seblak Bapper Instan Hot Spicy/Seblak Bapper P...,16900,V3 OnSHop,Surabaya,21,4.9


In [11]:
# Tidying up the index after removing duplicate data
df_cleaned = df_cleaned.reset_index(drop=True)
df_cleaned
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121 entries, 0 to 120
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   nama           121 non-null    object 
 1   harga          121 non-null    int32  
 2   penjual        121 non-null    object 
 3   kota           121 non-null    object 
 4   total_terjual  121 non-null    int32  
 5   rating         121 non-null    float64
dtypes: float64(1), int32(2), object(3)
memory usage: 4.9+ KB


Description:  
To anticipate the presence of duplicate data, duplicate data elimination is performed based on the name column. As a result, there are currently a total of 121 entries remaining.

In [12]:
# Saving the retrieved data to a .csv file
df_cleaned.to_csv('cleandata.csv', index=False) 

# **D. Phase: 4 Data Analysis**

## Import Library Scipy, Numpy dan Loading Data

In [13]:
# import library scipy dan numpy
import scipy.stats as stats
from scipy.stats import ttest_ind
from scipy.stats import spearmanr
import numpy as np

In [14]:
# Loading data cleaned
df1 = pd.read_csv('cleandata.csv')
df1
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121 entries, 0 to 120
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   nama           121 non-null    object 
 1   harga          121 non-null    int64  
 2   penjual        121 non-null    object 
 3   kota           121 non-null    object 
 4   total_terjual  121 non-null    int64  
 5   rating         121 non-null    float64
dtypes: float64(1), int64(2), object(3)
memory usage: 5.8+ KB


## Analyzing the measurement of mean, median, standard deviation, skewness, and kurtosis for the harga, total_penjualan, and rating columns

### Kolom harga

In [16]:
# Menghitung data rata-rata, median, standard deviasi
harga_mean = df1['harga'].mean()
harga_median = df1['harga'].median()
harga_std = df1['harga'].std()

# Menghitung skewness dan kurtosis
harga_skew = df1['harga'].skew()
harga_kurto = df1['harga'].kurtosis()

print(f'mean: {harga_mean}')
print(f'median: {harga_median}')
print(f'standard deviasi: {harga_std}')
print(f'skewness: {harga_skew}')
print(f'kurtosis: {harga_kurto}')

mean: 24824.421487603307
median: 15000.0
standard deviasi: 33956.66216536211
skewness: 3.775126295846377
kurtosis: 17.191033849150266


Description:  
Based on the table above, the average selling price is Rp 24,824. The median value is Rp 15,000. Meanwhile, the standard deviation is 33,956, which is higher than the average value, indicating significant price variability. Additionally, the skewness value of 3.78 indicates positive skew or right skew, where the tail is longer on the right side and the peak of the data is on the left side. This suggests that there are prices higher than the average price. Lastly, the kurtosis value of 17.19 indicates a heavier-tailed distribution than a normal distribution (leptokurtic), indicating the presence of extreme price values in the data.

### Kolom total_terjual

In [17]:
# Menghitung data rata-rata, median, standard deviasi
terjual_mean = df1['total_terjual'].mean()
terjual_median = df1['total_terjual'].median()
terjual_std = df1['total_terjual'].std()

# Menghitung skewness dan kurtosis
terjual_skew = df1['total_terjual'].skew()
terjual_kurto = df1['total_terjual'].kurtosis()

print(f'mean: {terjual_mean}')
print(f'median: {terjual_median}')
print(f'standard deviasi: {terjual_std}')
print(f'skewness: {terjual_skew}')
print(f'kurtosis: {terjual_kurto}')

mean: 595.8099173553719
median: 100.0
standard deviasi: 1832.393936512423
skewness: 4.2393591021262695
kurtosis: 17.939872532084586


Descripion:  
Based on the table above, the average value of the total_sales column, or sales, is 596 orders. The median value is 100 orders. Meanwhile, the standard deviation is 1,832, which is higher than the average value, indicating significant variability in sales data. Additionally, the skewness value of 4.24 indicates positive skew or right skew, where the tail is longer on the right side and the peak of the data is on the left side. This suggests that there are very high sales values in the data. Lastly, the kurtosis value of 17.94 indicates a heavier-tailed distribution than a normal distribution (leptokurtic), indicating the presence of extremely large sales in the data.

### Kolom rating

In [18]:
# Menghitung data rata-rata, median, standard deviasi
rating_mean = df1['rating'].mean()
rating_median = df1['rating'].median()
rating_std = df1['rating'].std()

# Menghitung skewness dan kurtosis
rating_skew = df1['rating'].skew()
rating_kurto = df1['rating'].kurtosis()

print(f'mean: {rating_mean}')
print(f'median: {rating_median}')
print(f'standard deviasi: {rating_std}')
print(f'skewness: {rating_skew}')
print(f'kurtosis: {rating_kurto}')

mean: 4.471074380165289
median: 4.9
standard deviasi: 1.357475476545152
skewness: -2.9952760157615335
kurtosis: 7.245773058707465


Description:  
Based on the table above, the average rating per store is 4.47 out of a scale of 0-5, indicating that overall, the store sellers are excellent. The median value is 4.9. Meanwhile, the standard deviation is 1.357, which is smaller than the average value, indicating that the rating per store has less variability. Additionally, the skewness value of -2.99 indicates negative skew or left skew, where the tail is longer on the left side and the peak of the data is on the right side. This suggests that there are quite a few lower ratings in the data, possibly due to new stores that have not yet received ratings. Lastly, the kurtosis value of 7.25 indicates a heavier-tailed distribution than a normal distribution (leptokurtic), indicating the presence of relatively small ratings in the data.

## Does selling seblak have good potential? Let's calculate the minimum and maximum income potential that can be obtained from seblak sales in each store.

In [19]:
# Adding income column for each store
df1['pendapatan'] = df1['total_terjual'] * df1['harga']

# Check tabel
df1

Unnamed: 0,nama,harga,penjual,kota,total_terjual,rating,pendapatan
0,Seblak Instan Ceu Nthien Khas Bandung Rasana N...,17000,Central Seblak Nusantara,Tangerang Selatan,2000,4.9,34000000
1,Kylafood Seblak Cup,11000,Kylafood Jakarta,Jakarta Selatan,100,5.0,1100000
2,Kylafood Seblak Rempah Autentik,10000,Kylafood Jakarta,Jakarta Selatan,100,5.0,1000000
3,3 BASO ACI FREE SEBLAK SAJODO SNACK & FOOD,63000,Sajodo Snack & Food,Tasikmalaya,40,5.0,2520000
4,Seblak instant murah,4900,dapurbulbin,Bandung,0,0.0,0
...,...,...,...,...,...,...,...
116,baso aci / seblak instan,6000,Nyum_mama,Pekanbaru,40,4.6,240000
117,CUANKIE SIOMAY ISI 50 TOPPING BASO ACI / SEBLAK,18000,Pusat Cuankie & Cemilan Frozen,Jakarta Timur,29,4.5,522000
118,150 Pcs CUANKI SIOMAY Kering Topping Pelengkap...,34500,Endo Shop ID,Kab. Garut,30,5.0,1035000
119,Seblak Bapper Instan Hot Spicy/Seblak Bapper P...,16900,V3 OnSHop,Surabaya,21,4.9,354900


In [20]:
# Calculating the average income value
pendapatan_mean = df1['pendapatan'].mean()

# Calculating the standard deviation of income
pendapatan_std = df1['pendapatan'].std()

# Calculating the total number of data points
pendapatan_n = len(df1['pendapatan'])

# Show the calculation results
print(f'mean: {pendapatan_mean}')
print(f'std deviasi: {pendapatan_std}')
print(f'jumlah data pendapatan: {pendapatan_n}')

mean: 7266444.900826447
std deviasi: 21253493.234401066
jumlah data pendapatan: 121


In [21]:
# Calculating the confidence interval from a normal distribution
min_value, max_value = stats.norm.interval(0.95, loc=pendapatan_mean, scale=pendapatan_std/np.sqrt(pendapatan_n))

# Show the calculation results
print(f'pendapatan minimum: {min_value}')
print(f'pendapatan maximum: {max_value}')

pendapatan minimum: 3479528.420363556
pendapatan maximum: 11053361.381289337


Description:  
Based on the above calculation, the minimum income obtained is Rp 3,479,528 and the maximum income is Rp 11,053,361. With a 95% confidence level, the income from selling seblak is quite tempting, especially when we sell the product without any capital as a dropshipper.

## To assess market demand, is there a difference in prices between the Jabodetabek area and outside the Jabodetabek area? This could be due to potential differences in the cost of raw materials obtained.

For comparing price data between the Jabodetabek area and outside the Jabodetabek area, let's make the assumptions as follows:  
- H0 = There is no significant difference between prices in the Jabodetabek area and outside the Jabodetabek area.
- H1 = There is a significant difference between prices in the Jabodetabek area and outside the Jabodetabek area.  

With a 95% confidence level and alpha of 0.05.  
For the analysis method, because we want to compare prices between Jabodetabek and outside Jabodetabek, we will take the average prices, and the data from Jabodetabek and outside Jabodetabek are independent. Therefore, we will perform an Independent T-test analysis.

The first step is to look at the store locations from the table we have.

In [22]:
# Check the store locations
df1['kota'].unique()

array(['Tangerang Selatan', 'Jakarta Selatan', 'Tasikmalaya', 'Bandung',
       'Depok', 'Jakarta Timur', 'Jakarta Barat', 'Kab. Bogor',
       'Surakarta', 'Surabaya', 'Kab. Garut', 'Kab. Bandung',
       'Jakarta Utara', 'Cimahi', 'Kab. Bekasi', 'Kab. Tangerang',
       'Pangkal Pinang', 'Tangerang', 'Kab. Sumedang', 'Semarang',
       'Kab. Bandung Barat', 'Medan', 'Jakarta Pusat', 'Bekasi',
       'Kab. Sukabumi', 'Bogor', 'Kab. Nganjuk', 'Kab. Purworejo',
       'Pekanbaru'], dtype=object)

Next, separate the cities in the Jabodetabek area (Jakarta, Bogor, Depok, Tangerang, Bekasi) from those outside the Jabodetabek area.

In [23]:
# Create a Region column to differentiate between Jabodetabek and areas outside Jabodetabek.
df1['wilayah'] = df1['kota'].apply(lambda x: 'Jabodetabek' if x in ['Jakarta Barat', 'Jakarta Selatan', 'Jakarta Timur', 'Jakarta Pusat', 'Bogor', 'Kab. Bogor', 'Depok', 'Tangerang', 'Tangerang Selatan', 'Kab. Tangerang', 'Bekasi', 'Kab. Bekasi'] else 'Luar Jabodetabek')

# Check tabel
df1

Unnamed: 0,nama,harga,penjual,kota,total_terjual,rating,pendapatan,wilayah
0,Seblak Instan Ceu Nthien Khas Bandung Rasana N...,17000,Central Seblak Nusantara,Tangerang Selatan,2000,4.9,34000000,Jabodetabek
1,Kylafood Seblak Cup,11000,Kylafood Jakarta,Jakarta Selatan,100,5.0,1100000,Jabodetabek
2,Kylafood Seblak Rempah Autentik,10000,Kylafood Jakarta,Jakarta Selatan,100,5.0,1000000,Jabodetabek
3,3 BASO ACI FREE SEBLAK SAJODO SNACK & FOOD,63000,Sajodo Snack & Food,Tasikmalaya,40,5.0,2520000,Luar Jabodetabek
4,Seblak instant murah,4900,dapurbulbin,Bandung,0,0.0,0,Luar Jabodetabek
...,...,...,...,...,...,...,...,...
116,baso aci / seblak instan,6000,Nyum_mama,Pekanbaru,40,4.6,240000,Luar Jabodetabek
117,CUANKIE SIOMAY ISI 50 TOPPING BASO ACI / SEBLAK,18000,Pusat Cuankie & Cemilan Frozen,Jakarta Timur,29,4.5,522000,Jabodetabek
118,150 Pcs CUANKI SIOMAY Kering Topping Pelengkap...,34500,Endo Shop ID,Kab. Garut,30,5.0,1035000,Luar Jabodetabek
119,Seblak Bapper Instan Hot Spicy/Seblak Bapper P...,16900,V3 OnSHop,Surabaya,21,4.9,354900,Luar Jabodetabek


Next, perform an independent T-test comparative analysis.

In [24]:
# Create variables for the Jabodetabek and Outside Jabodetabek region columns, as well as the price column.
harga_jabodetabek = df1[df1['wilayah'] == 'Jabodetabek'] ['harga']
harga_luar_jabodetabek = df1[df1['wilayah'] == 'Luar Jabodetabek'] ['harga']

harga_jabodetabek_mean = harga_jabodetabek.mean()
harga_luar_jabodetabek_mean = harga_luar_jabodetabek.mean()

# Perform an independent T-test.
t_stat, p_value = ttest_ind(harga_jabodetabek, harga_luar_jabodetabek, equal_var=False)

# Menampilkan hasil t-test dan p value
print(f't-test: {t_stat}')
print(f'p-value: {p_value}')
print(f'rata-rata harga jabodetabek: {harga_jabodetabek_mean}')
print(f'rata-rata harga luar jabodetabek: {harga_luar_jabodetabek_mean}')

t-test: -0.5945264821136899
p-value: 0.5535924142008497
rata-rata harga jabodetabek: 23201.35714285714
rata-rata harga luar jabodetabek: 27052.156862745098


Description:  
From the above t-test results, where the t-test value is -0.59, indicating a negative value, it suggests that the average price in Jabodetabek tends to be lower than outside Jabodetabek. Furthermore, the p-value is 0.553, where p-value > alpha (0.05), meaning H0 is not rejected. Thus, from the comparative test, it can be concluded that there is no significant difference between prices in the Jabodetabek area and outside Jabodetabek. It can be inferred that if we want to engage in dropshipping, we can use suppliers both in Jabodetabek or outside Jabodetabek.

## Persaingan harga jual seblak merupakan salah satu faktor yang dilihat oleh pembeli untuk membeli. Namun, apakah pembeli seblak ini lebih menyukai seblak dengan harga murah?

The competition in seblak selling prices is one of the factors considered by buyers when making a purchase. However, do seblak buyers prefer cheaper-priced seblak?

Let's perform a Spearman correlation analysis with the following assumptions:
- H0 = There is no correlation between price and total__sold (sales).
- H1 = There is a correlation between price and total__sold (sales).

With a 95% confidence level and alpha of 0.05.

In [25]:
# Create a variable for price from the price column and sales from the total__sold column
harga = df1['harga']
penjualan = df1['total_terjual']

# Calculate the Spearman correlation and p-value
corr_rho, pval_s = stats.spearmanr(harga, penjualan)

# Display the Spearman correlation results and p-value
print(f'korelasi spearman: {corr_rho}')
print(f'p-value: {pval_s}')

korelasi spearman: -0.21956462791875084
p-value: 0.015531364180758818


Description:  
The Spearman correlation value is -0.219, indicating a negative correlation. This suggests that there is a tendency for a decreasing relationship, meaning that as the price increases, the sales level decreases, or vice versa. The p-value is less than alpha (0.05), indicating that H0 is rejected. This indicates that there is a correlation between price and sales. It can be concluded that with higher prices, sales levels will decrease. If you want to start a seblak business, it is advisable to pay attention to the selling price. However, this analysis is still outside the scope of planning for advertising and other factors.

# **E. Phase 5: Conclusion**

After conducting a thorough analysis of seblak sales data through the Tokopedia platform, I drew some interesting and informative conclusions. Initially, the data collection process through web scraping encountered some challenges. Promotional ads appearing in the first three lines of each page posed an obstacle that needed to be overcome. However, with careful attention, I managed to overcome this issue and successfully collected data from 15 Tokopedia pages.

Subsequently, after the data was successfully collected, the next step was to prepare the data for further analysis. The raw data obtained required processing to make it easier to analyze. This process involved adjusting data types, handling missing values, and removing duplicate entries. After the data preparation process was completed, the remaining data totaled 121 entries, ready for further analysis.

One crucial aspect of the analysis was assessing the selling price of seblak. Although the average selling price reached Rp 24,824, it is important to note that there were significant outliers affecting this average. Therefore, the median price was considered more representative. The same applied to the analysis of seblak sales, where outliers affected the average value.

Furthermore, the data also indicated that the average rating of seblak stores was 4.7 out of a scale of 0-5. This indicates that seblak products receive positive feedback from customers, which is a good indication for business success.

After evaluating the potential minimum and maximum income from seblak sales, I found that the income could range from Rp 3,479,528 to Rp 11,053,361. This provides a clear picture of the income potential that can be obtained from this business.

Additionally, the analysis showed that there was no significant difference in product prices between the Jabodetabek and non-Jabodetabek areas. This provides flexibility in selecting suppliers, as prices do not significantly differ between regions.

From the correlation analysis between selling price and sales, it was found that the higher the selling price, the lower the sales volume. This indicates that selling price can significantly affect sales volume.

Lastly, based on the analysis to achieve a target profit of 15%, the recommended selling price was around Rp 17,250. With this strategy, I can target sales of 50 orders per week, with total revenue reaching Rp 10,350,000 in the next 3 months. With a 15% profit margin, the total income in 3 months could reach Rp 1,552,500.