
===============================================================================================  

Web Scraping Analysis: Product Seblak from Tokopedia

By __Angger Rizky Firdaus__  
This project is developed to accomplish Graded Challenge 3 of the FTDS Hacktiv8 program.  
Performing web scraping on Tokopedia's e-commerce platform for seblak products to analyze the seblak market on the platform.


===============================================================================================  



## __Background__

The initial objective of this project is to conduct data analysis on seblak products available on Tokopedia. This analysis aims to provide insights into the seblak market trends and patterns on Tokopedia before initiating the dropshipping business. What I would do is do:
a. webscrapping
b. Data preparation
c. Business Understanding (SMART framework)
d. Analysis

In [2]:
# List of libraries used
from scipy import stats
import numpy as np
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup
import requests

np.random.seed(10)

## __Web Scrapping__

Scraping the search results using the keyword "seblak" on the Tokopedia platform to gather information such as Product Name, Product Price, Seller, City of Store, Number of Items Sold, and Product Rating.

In [54]:
# Syntax to activate the webdriver that uses the browser used.
driver = webdriver.Edge()

In [55]:
# syntax to automatically open the desired link in a previously opened browser.
tokped = 'https://www.tokopedia.com/search?navsource=&page=1&q=seblak&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st='

driver.get(tokped)

In [56]:
# syntax to save the html contained in the website that is open at that time.
html = driver.page_source

The HTML that is saved is only what the website displays in real time.

In [None]:
# syntax to display html that has been stored in the previous html variable.
soup = BeautifulSoup(html,'html.parser')
print(soup.prettify())

The function of beautiful soup and prettify is to display the HTML/abstract code above neatly.

In [None]:
# # html for item names on Tokopedia.

# <div class="prd_link-product-name css-3um8ox" data-testid="spnSRPProdName">Gelifood Seblak Instan Kerupuk Mawar Bumbu Kencur Rafael Pedas Gurih</div>

In [None]:
# # html price of goods on Tokopedia.
# <div class="prd_link-product-price css-h66vau" data-testid="spnSRPProdPrice">Rp13.000</div>

In [None]:
# # html city of goods seller on Tokopedia.

# <span class="prd_link-shop-loc css-1kdc32b flip" data-testid="spnSRPProdTabShopLoc">Kab. Garut</span>

In [None]:
# # html seller of goods on Tokopedia.

# <span class="prd_link-shop-name css-1kdc32b flip" data-testid="">Lidigeli</span>

In [None]:
# # html item rating on Tokopedia.

# <span class="prd_rating-average-text css-t70v7i" data-testid="">4.8</span>

In [None]:
# # html number of goods sold on Tokopedia.
# <span class="prd_label-integrity css-1sgek4h" data-testid="">250+ terjual</span>

In [70]:
# Syntax for doing webscraping.

# empty list to store the required information
list_nama = []
list_harga = []
list_penjual = []
list_kota = []
list_sold = []
list_rating = []

# syntax for looping a website of 10 pages
for i in range(1,11):
    tokped_link = (f'https://www.tokopedia.com/search?navsource=&page={i}&q=seblak&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=')
    driver.get(tokped_link)
    
  # syntax for scrolling slowly to get all the desired data.
    total_height = int(driver.execute_script("return document.body.scrollHeight"))
    for i in range(1, total_height, 5):
        driver.execute_script("window.scrollTo(0, {});".format(i))
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    # syntax to search for the class and attributes of the desired html (desired section.)
    elements = soup.find_all('div', {'class':"css-jza1fo",
                                     'class' :"css-llwpbs"})
        
    #result of looping from var. elements are reduced by looping to search
    #name, price, seller, city, rating and quantity sold.                 
    for el in elements:
        try:
            nama = el.find('div', {'class':"prd_link-product-name css-3um8ox",
                                    'data-testid':"spnSRPProdName"})
            list_nama.append(nama.get_text())
        except:
            list_nama.append(None)

        try:
            harga = el.find('div', {'class':"prd_link-product-price css-h66vau",
                                    'data-testid':"spnSRPProdPrice"})
            list_harga.append(harga.get_text())
        except:
            list_harga.append(None)

        try:
            penjual = el.find('span', {'class':"prd_link-shop-name css-1kdc32b flip",
                                    'data-testid':""})
            list_penjual.append(penjual.get_text())
        except:
            list_penjual.append(None)

        try:
            kota = el.find('span', {'class':"prd_link-shop-loc css-1kdc32b flip",
                                    'data-testid':"spnSRPProdTabShopLoc"})
            list_kota.append(kota.get_text())
        except:
            list_kota.append(None)
        try:
            sold = el.find('span', {'class':"prd_label-integrity css-1sgek4h",
                                    'data-testid':""})
            list_sold.append(sold.get_text())
        except:
            list_sold.append(None)

        try:
            rating = el.find('span', {'class':"prd_rating-average-text css-t70v7i",
                                    'data-testid':""})
            list_rating.append(rating.get_text())
        except:
            list_rating.append(None)




In [71]:
# to calculate the amount of data obtained from the results of webscraping 10 pages.
print(len(list_nama))
print(len(list_harga))
print(len(list_penjual))
print(len(list_kota))
print(len(list_sold))
print(len(list_rating))

849
849
849
849
849
849


The data in each list must match each other so that it can be prepared into a dataframe table. so that the webscrapping syntax contains an except function for error data which can be saved with null/none data.

In [72]:
# Syntax for creating a dataframe/table for the webscraping results of the Tokopedia website.
tokped_df = pd.DataFrame({
    'nama_produk' : list_nama,
    'harga_produk' : list_harga,
    'penjual' : list_penjual,
    'lokasi_toko': list_kota,
    'jumlah_terjual' : list_sold,
    'rating_produk' : list_rating
})

In [73]:
#finished table.
tokped_df

Unnamed: 0,nama_produk,harga_produk,penjual,lokasi_toko,jumlah_terjual,rating_produk
0,Gelifood Seblak Instan Kerupuk Mawar Bumbu Ken...,Rp13.000,Lidigeli,Kab. Garut,250+ terjual,4.8
1,Kylafood Seblak Original Play,Rp15.050,kylafood,Bandung,1rb+ terjual,5.0
2,"Seblak Rafael, Seblak Coet Instan Halal",Rp25.000,Brother Meat Shop,Depok,100+ terjual,4.9
3,Seblak Instan Ceu Nthien Khas Bandung Rasana N...,Rp17.000,Central Seblak Nusantara,Tangerang Selatan,2rb+ terjual,4.9
4,KERUPUK SEBLAK MENTAH ANEKA WARNA,Rp11.500,jajanangarut19,Jakarta Selatan,250+ terjual,4.9
...,...,...,...,...,...,...
844,Seblak Instan Sehati,Rp13.000,Official Kartika Sari,Bandung,500+ terjual,5.0
845,Bumbu Seblak - Cairo Food,Rp25.500,Cairo Food,Jakarta Pusat,29 terjual,5.0
846,KERUPUK/KRUPUK SEBLAK RASA PEDAS TERLARIS ISI ...,Rp16.695,Toko kue Sumber Mas,Jakarta Timur,250+ terjual,4.9
847,Seblak Merah Pedas / Kerupuk Bawang Oleh Oleh ...,Rp9.000,The Little Snacks,Tangerang Selatan,500+ terjual,4.9


In [74]:
#storing dataframe table results in a CSV file.
tokped_df.to_csv('P0G3_Angger_tokped.csv', index=False)

## __Data Preparation__

Perform 2 tasks for data preparation:

1. Data Exploration:
   Display summaries of the data and write insights derived from it.

2. Data Cleaning:
   Cleanse columns containing unnecessary symbols or characters, handle missing values, and adjust data types for columns that do not match their values.

### Data Exploration

In [3]:
#display data in a csv file for pandas to read.
tokped = pd.read_csv('P0G3_Angger_tokped.csv')

In [4]:
# Seblak product data on the Tokopedia website.
tokped

Unnamed: 0,nama_produk,harga_produk,penjual,lokasi_toko,jumlah_terjual,rating_produk
0,Gelifood Seblak Instan Kerupuk Mawar Bumbu Ken...,Rp13.000,Lidigeli,Kab. Garut,250+ terjual,4.8
1,Kylafood Seblak Original Play,Rp15.050,kylafood,Bandung,1rb+ terjual,5.0
2,"Seblak Rafael, Seblak Coet Instan Halal",Rp25.000,Brother Meat Shop,Depok,100+ terjual,4.9
3,Seblak Instan Ceu Nthien Khas Bandung Rasana N...,Rp17.000,Central Seblak Nusantara,Tangerang Selatan,2rb+ terjual,4.9
4,KERUPUK SEBLAK MENTAH ANEKA WARNA,Rp11.500,jajanangarut19,Jakarta Selatan,250+ terjual,4.9
...,...,...,...,...,...,...
844,Seblak Instan Sehati,Rp13.000,Official Kartika Sari,Bandung,500+ terjual,5.0
845,Bumbu Seblak - Cairo Food,Rp25.500,Cairo Food,Jakarta Pusat,29 terjual,5.0
846,KERUPUK/KRUPUK SEBLAK RASA PEDAS TERLARIS ISI ...,Rp16.695,Toko kue Sumber Mas,Jakarta Timur,250+ terjual,4.9
847,Seblak Merah Pedas / Kerupuk Bawang Oleh Oleh ...,Rp9.000,The Little Snacks,Tangerang Selatan,500+ terjual,4.9


Table related to seblak product data on the Tokopedia website. The table contains product names, prices, sellers, store locations, number of items sold, and product ratings. The obtained data consists of 849 rows. The data has not been cleaned yet, so it is raw data that needs to be cleaned.

In [5]:
#syntax to display the top 5 data
tokped.head()

Unnamed: 0,nama_produk,harga_produk,penjual,lokasi_toko,jumlah_terjual,rating_produk
0,Gelifood Seblak Instan Kerupuk Mawar Bumbu Ken...,Rp13.000,Lidigeli,Kab. Garut,250+ terjual,4.8
1,Kylafood Seblak Original Play,Rp15.050,kylafood,Bandung,1rb+ terjual,5.0
2,"Seblak Rafael, Seblak Coet Instan Halal",Rp25.000,Brother Meat Shop,Depok,100+ terjual,4.9
3,Seblak Instan Ceu Nthien Khas Bandung Rasana N...,Rp17.000,Central Seblak Nusantara,Tangerang Selatan,2rb+ terjual,4.9
4,KERUPUK SEBLAK MENTAH ANEKA WARNA,Rp11.500,jajanangarut19,Jakarta Selatan,250+ terjual,4.9


The following are the top 5 data from the seblak product table on Tokopedia.

In [6]:
#syntax to display the bottom 5 data
tokped.tail()


Unnamed: 0,nama_produk,harga_produk,penjual,lokasi_toko,jumlah_terjual,rating_produk
844,Seblak Instan Sehati,Rp13.000,Official Kartika Sari,Bandung,500+ terjual,5.0
845,Bumbu Seblak - Cairo Food,Rp25.500,Cairo Food,Jakarta Pusat,29 terjual,5.0
846,KERUPUK/KRUPUK SEBLAK RASA PEDAS TERLARIS ISI ...,Rp16.695,Toko kue Sumber Mas,Jakarta Timur,250+ terjual,4.9
847,Seblak Merah Pedas / Kerupuk Bawang Oleh Oleh ...,Rp9.000,The Little Snacks,Tangerang Selatan,500+ terjual,4.9
848,SEBLAK JUARA INSTAN MASAK BASAH ASLI BANDUNG E...,Rp22.000,Rasa Juara Indonesia,Kab. Bandung,1rb+ terjual,4.9


The following are the bottom 5 data from the seblak product table on Tokopedia.

In [7]:
# syntax to display the name
tokped.columns

Index(['nama_produk', 'harga_produk', 'penjual', 'lokasi_toko',
       'jumlah_terjual', 'rating_produk'],
      dtype='object')

The table has columns for product name, product price, seller name, shop location, number of products sold, and product rating on the Tokopedia website.

In [8]:
#syntax to display the city location of shops that sell seblak on Tokopedia.
tokped.lokasi_toko.unique()

array(['Kab. Garut', 'Bandung', 'Depok', 'Tangerang Selatan',
       'Jakarta Selatan', 'Jakarta Timur', 'Kab. Bogor', 'Jakarta Barat',
       'Kab. Boyolali', 'Jakarta Pusat', 'Bekasi', 'Kab. Tangerang',
       'Jakarta Utara', 'Kab. Indramayu', 'Surabaya', 'Kab. Bekasi',
       'Kab. Bandung', 'Medan', 'Kab.Ciamis', 'Cimahi', 'Surakarta',
       'Tangerang', 'Tasikmalaya', 'Palembang', 'Kab. Tasikmalaya',
       'Semarang', 'Kab. Sukabumi', 'Banjarbaru', 'Kab. Majalengka',
       'Makassar', 'Kab. Bandung Barat', 'Bogor', 'Kab. Sidoarjo',
       'Malang', 'Kab. Mojokerto', 'Kab. Banyuwangi', 'Kab. Malang',
       'Kab. Brebes', 'Kab. Gresik', 'Kab. Batang', 'Kab. Purwakarta',
       'Kab. Kebumen', 'Kab. Sleman', 'Kab. Karawang', 'Pekanbaru',
       'Sukabumi'], dtype=object)

From the data, it can be observed that seblak sellers on Tokopedia are distributed across various regions.

In [9]:
# syntax to display a summary of the data obtained.
tokped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 849 entries, 0 to 848
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   nama_produk     849 non-null    object 
 1   harga_produk    849 non-null    object 
 2   penjual         849 non-null    object 
 3   lokasi_toko     849 non-null    object 
 4   jumlah_terjual  816 non-null    object 
 5   rating_produk   791 non-null    float64
dtypes: float64(1), object(5)
memory usage: 39.9+ KB


The .info() syntax provides insights into the table's data. It reveals that the table contains 849 rows of data, with missing values present in some columns (store location and product rating). Additionally, the data types in the table are not appropriate for the type of data contained in each column, such as the price of the product and the quantity sold. To address these issues, data cleaning is necessary to ensure the data is properly formatted.

### Data Cleaning

In [10]:
# Syntax to change symbols that are not used in values
tokped['harga_produk'].replace('Rp','', regex=True, inplace=True)
tokped['harga_produk'].replace('\.','', regex=True, inplace=True)
tokped['jumlah_terjual'].replace('\+ terjual','', regex=True, inplace=True)
tokped['jumlah_terjual'].replace('terjual','', regex=True, inplace=True)
tokped['jumlah_terjual'].replace('rb',"000", regex=True, inplace=True)



Symbols must be cleaned because they will interfere with the analysis process because python cannot read symbols such as (.) or (rb) or (Rp) so they are cleaned. The only data needed is the numbers.

In [11]:
#syntax to remove null/non data
tokped = tokped.dropna()

In [12]:
#syntax to change the datatype of a column to match its value. like
#columns for product price and number of products sold are integers.
tokped.harga_produk = tokped.harga_produk.astype('int64')
tokped.jumlah_terjual = tokped.jumlah_terjual.astype('int64')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tokped.harga_produk = tokped.harga_produk.astype('int64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tokped.jumlah_terjual = tokped.jumlah_terjual.astype('int64')


In [13]:
# syntax to clean duplicate data from each column.
#This syntax will delete rows that exactly match other rows.
tokped = tokped.drop_duplicates()

In [14]:
tokped.info()

<class 'pandas.core.frame.DataFrame'>
Index: 658 entries, 0 to 848
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   nama_produk     658 non-null    object 
 1   harga_produk    658 non-null    int64  
 2   penjual         658 non-null    object 
 3   lokasi_toko     658 non-null    object 
 4   jumlah_terjual  658 non-null    int64  
 5   rating_produk   658 non-null    float64
dtypes: float64(1), int64(2), object(3)
memory usage: 36.0+ KB


From this data, the data type for the quantity sold and product price columns has changed to integer.

In [15]:
#syntax to display whether there is null data
tokped.isnull().any()

nama_produk       False
harga_produk      False
penjual           False
lokasi_toko       False
jumlah_terjual    False
rating_produk     False
dtype: bool

These results explain that in the entire table there are no false/null values. so that the data already has the appropriate value.

In [16]:
# to display the product_price column
tokped[["harga_produk"]]

Unnamed: 0,harga_produk
0,13000
1,15050
2,25000
3,17000
4,11500
...,...
840,2881
841,13000
842,12500
843,16500


From the data cleaning results, the value in the harga_produk column no longer contains the character (Rp).

In [17]:
tokped[["jumlah_terjual"]]


Unnamed: 0,jumlah_terjual
0,250
1,1000
2,100
3,2000
4,250
...,...
840,16
841,60
842,50
843,5


From the data cleaning results, the value in the jumlah_terjual column no longer contains the symbol (+) and the word "sold"

In [32]:
tokped

Unnamed: 0,nama_produk,harga_produk,penjual,lokasi_toko,jumlah_terjual,rating_produk,jab,non
0,Gelifood Seblak Instan Kerupuk Mawar Bumbu Ken...,13000,Lidigeli,Kab. Garut,250,4.8,0,1
1,Kylafood Seblak Original Play,15050,kylafood,Bandung,1000,5.0,0,1
2,"Seblak Rafael, Seblak Coet Instan Halal",25000,Brother Meat Shop,Depok,100,4.9,1,0
3,Seblak Instan Ceu Nthien Khas Bandung Rasana N...,17000,Central Seblak Nusantara,Tangerang Selatan,2000,4.9,0,1
4,KERUPUK SEBLAK MENTAH ANEKA WARNA,11500,jajanangarut19,Jakarta Selatan,250,4.9,1,0
...,...,...,...,...,...,...,...,...
840,Kerupuk Seblak Mawar Kriwil Pedas 60gr ANABA,2881,Siaga Lapak,Malang,16,5.0,0,1
841,Seblak Ceker pedas daun jeruk 250 gram,13000,Markas Makaroni,Kab. Bekasi,60,4.9,1,0
842,CAMILAN SEBLAK PEDAS 250GR SNACK KILOAN JAJAN ...,12500,FARZHA SNACK STORE,Semarang,50,4.6,0,1
843,keripik seblak rafael pedas & asin,16500,keyz shop78,Bandung,5,5.0,0,1


The data that has been cleaned is ready to be analyzed.

## __SMART Framework__

Creating a SMART Framework for the requested case of becoming a dropshipper for seblak on the Tokopedia platform, leveraging a background in data science.

======================

__SMART__

- __Specific__: To generate profit from dropshipping seblak products through the Tokopedia platform.
- __Measurable__: Achieve a profit margin of 10% from each sale.
- __Achievable__: Analyze seblak product data from Tokopedia, including product prices, main seller locations, and ratings. Determine the product selection and selling prices.
- __Relevant__: Gain insights into the market to leverage for product promotion.
- __Time-based__: Achieve results within the next month.

======================

To become a successful seblak dropshipper by earning a 10% commission from each sale, conducting market analysis through the Tokopedia platform, and achieving results within the next month.

## __Analysis__

Performing analysis on the table of seblak product data on Tokopedia:

1. Calculate the mean, median, standard deviation, skewness, and kurtosis for the columns: product price, number of products sold, and product rating.
2. Calculate confidence intervals to determine the minimum and maximum potential revenue when selling seblak products.
3. Conduct hypothesis testing regarding the product prices in Jabodetabek and non-Jabodetabek areas. Is there a difference in pricing based on the seller's location?
4. Analyze the correlation to determine if buyers prefer cheaper products.

### 1. Calculate the average, median, standard deviation, skewness, and kurtosis in the product price, number of products sold, and product rating columns.

In [19]:
#Displays the average, median, standard deviation, skewness and kurtosis values in the product price column.
rata2_harga = tokped.harga_produk.mean()
tengah_harga = tokped.harga_produk.median(axis=0)
std_harga = tokped.harga_produk.std()
skew_harga = tokped.harga_produk.skew()
kurt_harga = tokped.harga_produk.kurt()

print(f'Rata-rata harga dari produk seblak adalah {rata2_harga}')
print(f'Nilai median harga dari produk seblak adalah {tengah_harga}')
print(f'Standar deviasi harga dari produk seblak adalah {std_harga}')
print(f'Nilai skewness harga dari produk seblak adalah {skew_harga}')
print(f'Nilai kurtosis harga dari produk seblak adalah {kurt_harga}')

Rata-rata harga dari produk seblak adalah 22583.281155015196
Nilai median harga dari produk seblak adalah 15000.0
Standar deviasi harga dari produk seblak adalah 24986.786719538606
Nilai skewness harga dari produk seblak adalah 3.407909618686799
Nilai kurtosis harga dari produk seblak adalah 15.792949123435596


__Explanation of Product Price Data__

Based on the data in the product price column, the mean is 22583.281155015196, the median is 15000.0, the standard deviation is 24986.786719538606, the skewness value is 3.407909618686799, and the kurtosis value is 15.792949123435596. From these figures, it can be inferred that the mean is greater than the median, indicating a __left-skewed or positively skewed distribution__. The high standard deviation compared to the mean also indicates that the __data varies significantly__. A skewness value higher than 1 also indicates that the data is __strongly__ left-skewed. A kurtosis value above 0 also indicates that the data is __leptokurtic__, indicating many outliers/extreme data points.

It can be inferred that many seblak products are sold at low prices around Rp15000 (approximately the median price). The mean value is influenced by outliers/prices that are higher than the majority of products. The prices offered by sellers also vary significantly.

In [20]:
# Displays the average, median, standard deviation, skewness and kurtosis values in the number of products sold column
rata2_terjual = tokped.jumlah_terjual.mean()
tengah_terjual = tokped.jumlah_terjual.median(axis=0)
std_terjual = tokped.jumlah_terjual.std()
skew_terjual = tokped.jumlah_terjual.skew()
kurt_terjual = tokped.jumlah_terjual.kurt()

print(f'Rata-rata jumlah produk seblak terjual adalah {rata2_terjual}')
print(f'Nilai median jumlah produk seblak terjual adalah {tengah_terjual}')
print(f'Standar deviasi jumlah produk seblak terjual adalah {std_terjual}')
print(f'Nilai skewness jumlah produk seblak terjual adalah {skew_terjual}')
print(f'Nilai kurtosis jumlah produk seblak terjual adalah {kurt_terjual}')

Rata-rata jumlah produk seblak terjual adalah 362.41185410334344
Nilai median jumlah produk seblak terjual adalah 75.0
Standar deviasi jumlah produk seblak terjual adalah 1089.5167469286384
Nilai skewness jumlah produk seblak terjual adalah 6.357000334654058
Nilai kurtosis jumlah produk seblak terjual adalah 47.401601968584494


__Explanation of Sold Product Data__

Based on the data in the column of sold product quantity, the mean is 362.41185410334344, the median is 75.0, and the standard deviation is 1089.5167469286384. It can be inferred that the mean value greater than the median suggests a __left-skewed or positively skewed distribution__. The high standard deviation, which is also higher than the mean, indicates that the __data varies significantly__. A high skewness value suggests that the data is __strongly__ skewed. A kurtosis value above 0 also indicates that the data is __leptokurtic__, suggesting many outliers/extreme data points.

Based on this explanation, most products are sold in quantities around 75 pieces (approximately the median value). The mean value is influenced by outliers from products that have been sold in quantities far exceeding the majority of seblak products on Tokopedia. The quantity of sold items for each product varies significantly.

In [21]:
#Displays the average, median, standard deviation, skewness and kurtosis values in the product rating column.
rata2_rating = tokped.rating_produk.mean()
tengah_rating = tokped.rating_produk.median(axis=0)
std_rating = tokped.rating_produk.std()
skew_rating = tokped.rating_produk.skew()
kurt_rating = tokped.rating_produk.kurt()

print(f'Rata-rata rating produk seblak adalah {rata2_rating}')
print(f'Nilai median rating produk seblak adalah {tengah_rating}')
print(f'Standar deviasi rating produk seblak adalah {std_rating}')
print(f'Nilai skewness rating produk seblak adalah {skew_rating}')
print(f'Nilai kurtosis rating produk seblak adalah {kurt_rating}')

Rata-rata rating produk seblak adalah 4.87127659574468
Nilai median rating produk seblak adalah 4.9
Standar deviasi rating produk seblak adalah 0.20124092824718232
Nilai skewness rating produk seblak adalah -5.951229714629269
Nilai kurtosis rating produk seblak adalah 66.91180332035384


__Explanation of Product Rating Data__

Based on the data in the product rating column, the mean is 4.87127659574468, the median is 4.9, and the standard deviation is 0.20124092824718232, with a skewness value of -5.951229714629269, and a kurtosis value of 66.91180332035384. With the mean value smaller than the median, the rating data is suspected to be __right-skewed or negatively skewed__. The low standard deviation below the mean suggests that the __data is not highly variable__. The skewness value below -1 indicates that the data is __strongly skewed to the right__. A kurtosis value above 0 also suggests that the data is __leptokurtic__, indicating many outliers/extreme data points. Although the difference between the mean and median is minimal (a normal distribution if the mean and median are equal), the high skewness and kurtosis values suggest otherwise.

Based on this explanation, most products receive a rating of 4.9 (based on the median value). The lower mean compared to the median, unlike other data, is due to many outliers/very low rating values that affect the mean. However, the data's distribution is not highly variable due to the low standard deviation.

### 2. Calculate the confidence interval to get the minimum and maximum potential income if you sell seblak products?

I will calculate the maximum and minimum earnings I can obtain from each purchase and my profit within 1 month. (The profit I take from each transaction is 10% of the product price.)

__For Your Information__

The confidence interval consists of a lower limit and an upper limit. The lower limit is the lowest value within the 95% overall data average, and conversely, the upper limit is the highest value within the 95% overall data average.

In [22]:
#To get the confidence interval of the product price.
harga = tokped['harga_produk']
std = harga.std()
N = len(tokped)
low, up = stats.norm.interval(0.95,loc=harga.mean(),scale=std/np.sqrt(N))
print('Lower Limit:',low)
print('Upper Limit:',up)

Lower Limit: 20674.106967311465
Upper Limit: 24492.455342718928


The insight obtained from these values is that from the product price data, 95% of the products have a lowest average price of 20,674.106967311465 and have a highest average price of 24,492.455342718928.

In [23]:
#To get the confidence interval of the product price.
jumlah = tokped['jumlah_terjual']
std = jumlah.std()
N = len(tokped)
low, up = stats.norm.interval(0.95,loc=jumlah.mean(),scale=std/np.sqrt(N))
print('Lower Limit:',low)
print('Upper Limit:',up)

Lower Limit: 279.16476540583744
Upper Limit: 445.65894280084945


The insight obtained from this value is from price data on the number of products sold, 95% of products have the lowest average number of selling goods of 279.16476540583744 and have the highest average number of selling goods of 445.65894280084945.

In [24]:
#Syntax to get confidence from product seller income (product price*number sold)
pemasukan = tokped['harga_produk'] * tokped['jumlah_terjual']
std = pemasukan.std()
N = len(tokped)
low, up = stats.norm.interval(0.95,loc=pemasukan.mean(),scale=std/np.sqrt(N))
print('Lower Limit:',low)
print('Upper Limit:',up)

Lower Limit: 4500113.590434211
Upper Limit: 7966760.829018676


The insight obtained from this value is that 95% of sellers have the lowest average income of 4500113.590434211 and have the highest average of 7966760.829018676.

In [25]:
#Syntax to gain confidence from dropshipper profits with 10% commission
pemasukan = tokped['harga_produk'] * tokped['jumlah_terjual'] *0.1
std = pemasukan.std()
N = len(tokped)
low, up = stats.norm.interval(0.95,loc=pemasukan.mean(),scale=std/np.sqrt(N))
print('Lower Limit:',low)
print('Upper Limit:',up)

Lower Limit: 450011.3590434213
Upper Limit: 796676.0829018676


The insight obtained from these values is that if we, as dropshippers, take a 10% profit from the product price, the lowest average income is 450,011.3590434213 and the highest average income is 796,676.0829018676.

In [26]:
#Syntax to get confidence from gross income
pemasukan = tokped['harga_produk'] * tokped['jumlah_terjual'] *1.1
std = pemasukan.std()
N = len(tokped)
low, up = stats.norm.interval(0.95,loc=pemasukan.mean(),scale=std/np.sqrt(N))
print('Lower Limit:',low)
print('Upper Limit:',up)

Lower Limit: 4950124.9494776325
Upper Limit: 8763436.911920544


The insight obtained from these values is that if we, as dropshippers, sell items with a price that already includes a 10% profit, the lowest average gross income is 4,950,124.9494776325 and the highest average gross income is 8,763,436.911920544.

__Response:__

As a dropshipper looking to start selling seblak with a 10% commission from each sale, the minimum net profit that can be obtained in one month is Rp450,011, while the maximum profit is Rp796,676. This can be achieved if I, as a dropshipper, can sell an average minimum of 279 products and an average maximum of 445 products. However, to remain competitive with other sellers, it's essential to set prices within the range of the minimum average price of Rp20,674 and the maximum average price of Rp24,492.



### 3. Conduct hypothesis testing regarding the prices of goods in Jabodetabek and non-Jabodetabek. Is there a difference in the location of the seller's area for the prices used?

In [27]:
#To find out the location of the city selling seblak
lokasi = tokped.lokasi_toko.unique()
lokasi

array(['Kab. Garut', 'Bandung', 'Depok', 'Tangerang Selatan',
       'Jakarta Selatan', 'Jakarta Timur', 'Kab. Bogor', 'Jakarta Barat',
       'Kab. Boyolali', 'Jakarta Pusat', 'Bekasi', 'Kab. Tangerang',
       'Jakarta Utara', 'Kab. Indramayu', 'Surabaya', 'Kab. Bekasi',
       'Kab. Bandung', 'Medan', 'Kab.Ciamis', 'Cimahi', 'Surakarta',
       'Tangerang', 'Tasikmalaya', 'Palembang', 'Kab. Tasikmalaya',
       'Semarang', 'Kab. Sukabumi', 'Banjarbaru', 'Kab. Majalengka',
       'Makassar', 'Kab. Bandung Barat', 'Bogor', 'Kab. Sidoarjo',
       'Malang', 'Kab. Mojokerto', 'Kab. Malang', 'Kab. Brebes',
       'Kab. Gresik', 'Kab. Batang', 'Kab. Purwakarta', 'Kab. Kebumen',
       'Kab. Sleman', 'Pekanbaru', 'Sukabumi'], dtype=object)

In [28]:
# Syntax for grouping Jabodetabek and non-Jabodetabek areas.
jab = []
non = []
jabodetabek = tokped['lokasi_toko'].str.contains('Jakarta|Bogor|Depok|Tanggerang|Bekasi', regex=True)
for korwil in jabodetabek:
    if korwil == True:
        jab.append(1)
        non.append(0)

    else:
        jab.append(0)
        non.append(1)


tokped['jab'] = jab
tokped['non'] = non

__Hypothesis__
H0: Jabodetabek average = non-Jabodetabek average (Jabodetabek average is the same as non-Jabodetabek average)

H1: Jabodetabek average != non-Jabodetabek average ((Jabodetabek average is not the same as non-Jabodetabek average))

In [29]:
# syntax for testing 2 related samples
t_stat,p_val = stats.ttest_rel(tokped.jab,tokped.non) 
print('P-value:',p_val)

P-value: 8.975691311711243e-05


__Answer__  
Based on the p-value below 0.05 (taking a 95% confidence level), which is 8.97 x 10^-5, the alternative hypothesis (H1) suggests that the average price of seblak products in Jabodetabek is different from the price of seblak outside Jabodetabek.

These prices are influenced by various factors such as raw materials, production volumes of each seblak product, and target markets. It is suspected that prices in Jabodetabek are higher than those outside Jabodetabek due to lower raw material costs compared to those in Jabodetabek.

### 4. Analyze the correlation of whether buyers prefer products that are cheap?

In [31]:
#Syntax for analyzing correlations with the Pearson technique
corr_r, pval_p = stats.pearsonr(tokped['harga_produk'], tokped['jumlah_terjual'])


print(f"r-correlation: {corr_r:.2f}, p-value: {pval_p}")
#Spearson technique is used because there are 2 quantitative data and the data is considered to have a normal distribution.

r-correlation: -0.07, p-value: 0.06576496747150096


__Answer__

From the data, it can be concluded that there is a negative correlation, as indicated by the correlation coefficient of -0.07 and a p-value above 0.05. This implies an influence between the price of the product and the quantity sold.

The negative correlation suggests that as the price increases, the quantity sold tends to decrease (as the correlation line slopes downwards to the right), and conversely, lower prices tend to result in higher sales volume.

## __Conclusion__

Based on various analyses conducted, as a dropshipper, I can sell seblak within the price range of Rp20,674 to Rp24,492, which includes a 10% commission. To achieve a profit of Rp450,011 to Rp796,676 within one month, I need to sell between 279 to 445 units of the product. The pricing strategy is crucial as it directly impacts the potential profit and can attract more customers due to the preference for affordable products. However, to lower costs and increase profits, sourcing products from outside Jabodetabek, where raw materials are cheaper, can be beneficial. Choosing high-quality products from reputable sellers is essential as it influences the product rating. However, ratings tend to be consistent, making it relatively easier to maintain a good rating as long as the quality remains consistent.

For instance, by selling "SEBLAK JUARA INSTAN MASAK BASAH ASLI BANDUNG" from Rasa Juara Indonesia at a price of Rp24,000 (with a 10% commission from the original price), I could potentially earn a profit of Rp890,000 within one month by selling 445 units.