# TUNSIANET PC DATA SCRAPING

### Importing libraries

The libraries needed:

- Requests
- BeautifulSoup
- Pandas

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### Creating the main columns as arrays

In [2]:
products=[] 
prices=[] 
availability=[]

### Scraping the data from the Tunisianet website

In order to scrape all the PC data from the website, it is necessary to do so from all the different pages. For that purpose it is necessary to work with an for statement that allows us to change the URL and thus the page. Within the for statement, we must categorize the HTML data into their respective elements, which is why there is another for within the first one.
We then place those elements inside each of the arrays.

In [3]:
for i in range(1, 29):
    URL = "https://www.tunisianet.com.tn/702-ordinateur-portable?page=" + str(i) + "&order=product.price.desc"
    page = requests.get(URL)

    soup = BeautifulSoup(page.content, "html.parser")

    results = soup.find(id="js-product-list")

    product_elements = results.find_all("div", class_="item-product col-xs-12")

    for product_element in product_elements:
        name_element = product_element.find("h2", class_="h3 product-title")
        price_element = product_element.find("span", class_="price")
        avail_element = product_element.find("div", id="stock_availability")
        products.append(name_element.text.strip())
        prices.append(price_element.text.strip())
        availability.append(avail_element.text.strip())

### Creating the data frame

In [4]:
df = pd.DataFrame({'Product Name':products,'Price':prices, 'Availability':availability}) 

# Data Cleaning Process

### Removing "PC Portable" from the name

I noticed that "PC Portable" is repeated in most of the products except for gaming laptops, so I decided to remove it from the name column considering it adds no info

In [5]:
df["Product Name"] = df["Product Name"].str.replace("(?i)PC Portable","", regex=True)

### Removing extra spcaes

In [6]:
df["Product Name"] = df["Product Name"].str.strip()

### Changing the "," to "." in the price column

For future calculations, it's only possible to work with integers if we use the "."

We check what our table looks like with .head

In [7]:
df["Price"] = df["Price"].str.replace(",",".")
df.head(5)

Unnamed: 0,Product Name,Price,Availability
0,Station de travail Mobile Dell Precision 5560 ...,11 179.000 DT,Sur commande
1,Lenovo ThinkPad X1 Extreme / i7-11800H / RTX 3...,10 419.000 DT,Sur commande
2,MSI CreatorPro M16 / i7 12é Gen / RTX A3000 12...,9 975.000 DT,Sur commande
3,MSI Gaming GF63 Thin 11SC / i7 11è Gén / 16 Go...,9 514.000 DT,En stock
4,ASUS ROG ZEPHYUS DUO 16 GX650RW-LS052W / Ryzen...,9 399.000 DT,Sur commande


### Separating the name column

The name column included all of the information about the laptop, such as the size and brand name, so I'm going to attempt to separate that info into different columns.

The most common brands are 'ACER', 'APPLE', 'ASUS', 'DELL', 'HP', 'HUAWEI', 'INFINIX', 'LENOVO', and 'MSI' so we will be extracting those.

The most common RAM sizes are ' 2 GO', ' 4 GO', ' 6 GO', ' 8 GO', ' 12 GO', ' 14 GO', ' 16 GO', ' 18 GO', ' 20 GO', ' 24 GO', ' 26 GO',' 28 GO', ' 30 GO', ' 32 GO', ' 34 GO', and ' 36 GO'

It is important that the format used in the array that incldues all the data match that of the website. For example, the Tunisianet website specifically reffers to sizes as "Size Go", so working with "Size" or "Size G" would not work.

I also put all of the letters in uppercase to make sure everything is in the same format.

In [8]:
brandsa = ['ACER', 'APPLE', 'ASUS', 'DELL', 'HP', 'HUAWEI', 'INFINIX', 'LENOVO', 'MSI']
brands = []
rama = [' 2 GO', ' 4 GO', ' 6 GO', ' 8 GO', ' 12 GO', ' 14 GO', ' 16 GO', ' 18 GO', ' 20 GO', ' 24 GO', ' 26 GO',' 28 GO', ' 30 GO', ' 32 GO', ' 34 GO', ' 36 GO']
ram = []
check = False
products = df["Product Name"].values.tolist() ## Turn into an array

for i in range(len(products)):

    # Brand
    for j in range(len(brandsa)):
        if brandsa[j] in products[i].upper():
            brands.append(brandsa[j])
            check = True
    if check == False:
        brands.append("")
    check = False
    
    ## RAM
    for j in range(len(rama)):
        if rama[j] in products[i].upper():
            ram.append(rama[j])
            check = True
    if check == False:
        ram.append("")
    check = False

    
## Putting the arrays brands and size into the df
df["Brands"] = brands
df["RAM"] = ram

### Cleaning the name from all the brands and sizes, as well as any extra characters

In [9]:
for i in range(len(brandsa)):
    df["Product Name"] = df["Product Name"].str.replace(brandsa[i],"", regex=True, case=False)
for i in range(len(rama)):
    df["Product Name"] = df["Product Name"].str.replace(rama[i],"", regex=True, case=False)
df["Product Name"] = df["Product Name"].str.replace("/ /","/", regex=True, case=False)

## Final Result

In [10]:
df.head(5)

Unnamed: 0,Product Name,Price,Availability,Brands,RAM
0,Station de travail Mobile Precision 5560 Tact...,11 179.000 DT,Sur commande,DELL,32 GO
1,ThinkPad X1 Extreme / i7-11800H / RTX 3060 6G...,10 419.000 DT,Sur commande,LENOVO,32 GO
2,CreatorPro M16 / i7 12é Gen / RTX A3000 12G /,9 975.000 DT,Sur commande,MSI,32 GO
3,Gaming GF63 Thin 11SC / i7 11è Gén / GTX 1650 4G,9 514.000 DT,En stock,MSI,16 GO
4,ROG ZEPHYUS DUO 16 GX650RW-LS052W / Ryzen R7 ...,9 399.000 DT,Sur commande,ASUS,


### Exporting as a CSV file

In [11]:
df.to_csv('Tunisianet Computers.csv', index=False, encoding='utf-8')