# Data Wrangling with Python - Exception Handling Example

#### Import Libraries

In [14]:
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np

Let's assume you ran the same web scraping code we built earlier after a few weeks. Let's use the <b>cosmic-pc-store-updated.html</b> this time

In [15]:
with open('cosmic-pc-store-updated.html', 'r') as f:
    soup = BeautifulSoup(f, 'html.parser')

In [16]:
laptops_list = soup.find_all('center', attrs={'class':'iteminfo'})

In [17]:
scraped_data = []
for laptop in laptops_list:
    
    brandname = laptop.find('div', class_='brandname').text.strip()
    laptopname = laptop.find('div', class_='laptopname').text.strip()
    rating= laptop.find('div', class_='rating').text.strip()
    age = laptop.find('div', class_='age').text.strip()
    price = laptop.find('p', class_='listprice').text.strip()
    disc = laptop.find('a', class_='discount').text.strip()
    
    # category, ram, processor, display, storage, weight = specs
    specs = [x.text for x in laptop.find('div', class_='collapse').select('b')]

    # lets append the laptop's data into the list
    scraped_data.append([brandname, laptopname, rating, age, price, disc] + specs)

AttributeError: ignored

#### Let's investigate

If you open the webpage and scroll through, you will see that <b>age</b> and <b>rating</b> is missing for a lot of laptops. These means, when you try to find the class = 'age' for that laptop, you won't find it. This leads to the <b>AttributeError</b>: 'NoneType' object has no attribute 'text'.

This can very well be the case for other columns as well. As data scientists and engineers, this is something we need to be prepared for.

#### Let's try fixing this using exception handling statements
We shall fix this by storing age as a <b>NaN</b> value if it does not exist for the laptop. The same can be done for the rating as well

In [18]:
scraped_data = []
for laptop in laptops_list:
    
    brandname = laptop.find('div', class_='brandname').text.strip()
    laptopname = laptop.find('div', class_='laptopname').text.strip()
    
    try: rating = laptop.find('div', class_='rating').text.strip()
    except AttributeError: rating = np.nan
    
    try: age = laptop.find('div', class_='age').text.strip()
    except AttributeError: age = np.nan
    
    price = laptop.find('p', class_='listprice').text.strip()
    disc = laptop.find('a', class_='discount').text.strip()
    
    # category, ram, processor, display, storage, weight = specs
    specs = [x.text for x in laptop.find('div', class_='collapse').select('b')]

    # lets append the laptop's data into the list
    scraped_data.append([brandname, laptopname, rating, age, price, disc] + specs)

In [19]:
column_names = ['brand_name','laptop_name','rating','age','price','disc','category','ram','processor',
                'display','storage','weight']

In [20]:
df = pd.DataFrame(data=scraped_data, columns=column_names)

In [21]:
df.head(10)

Unnamed: 0,brand_name,laptop_name,rating,age,price,disc,category,ram,processor,display,storage,weight
0,Apple,MacBook Pro,Rating: 4/5,,$994,13%,Ultrabook,8GB,Intel Core i5 2.3GHz,IPS Panel Retina Display 2560x1600,128GB SSD,1.37kg
1,Apple,Macbook Air,Rating: 5/5,Age: 8 years,$929,5%,Ultrabook,8GB,Intel Core i5 1.8GHz,128GB Flash Storage,1.34kg,
2,HP,250 G6,Rating: 4/5,Age: 8 years,$507,8%,Notebook,Intel Core i5 7200U 2.5GHz,Full HD 1920x1080,256GB SSD,1.86kg,
3,Apple,MacBook Pro,Rating: 5/5,,$566,12%,Ultrabook,16GB,Intel Core i7 2.7GHz,IPS Panel Retina Display 2880x1800,512GB SSD,1.83kg
4,Apple,MacBook Pro,Rating: 3.5/5,Age: 4 years,$529,5%,8GB,Intel Core i5 3.1GHz,IPS Panel Retina Display 2560x1600,256GB SSD,1.37kg,
5,Acer,Aspire 3,Rating: 5/5,Age: 7 years,$580,15%,Notebook,4GB,AMD A9-Series 9420 3GHz,500GB HDD,2.1kg,
6,Apple,MacBook Pro,Rating: 5/5,Age: 8 years,$892,5%,Ultrabook,16GB,Intel Core i7 2.2GHz,IPS Panel Retina Display 2880x1800,256GB Flash Storage,2.04kg
7,Apple,Macbook Air,,Age: 4 years,$796,9%,Ultrabook,8GB,Intel Core i5 1.8GHz,1440x900,256GB Flash Storage,1.34kg
8,Asus,ZenBook UX430UN,Rating: 4.5/5,Age: 5 years,$686,4%,Ultrabook,Intel Core i7 8550U 1.8GHz,Full HD 1920x1080,512GB SSD,1.3kg,
9,Acer,Swift 3,,,$358,10%,8GB,Intel Core i5 8250U 1.6GHz,IPS Panel Full HD 1920x1080,256GB SSD,1.6kg,


### Now we have discovered another problem!

Take a look at <font color='red'>row 4</font>. Looks like some specifications were missing, and this has led to wrong data entry. As Category was missing, ram was encountered first, and hence the value of category became 8GB - which is incorrect. And this effect follows <font color='red'>all the way to the end</font>

#### We can fix this by writing better code
Always remember, while performing web scraping, always try to look for exactly what you need! Don't be ambiguous in your search.

In [22]:
scraped_data = []
for laptop in laptops_list:
    
    brandname = laptop.find('div', class_='brandname').text.strip()
    laptopname = laptop.find('div', class_='laptopname').text.strip()
    
    try: rating = laptop.find('div', class_='rating').text.strip()
    except AttributeError: rating = np.nan
    
    try: age = laptop.find('div', class_='age').text.strip()
    except AttributeError: age = np.nan
    
    price = laptop.find('p', class_='listprice').text.strip()
    disc = laptop.find('a', class_='discount').text.strip()
    
    # category, ram, processor, display, storage, weight = specs
    try: category = laptop.find('b',class_='category').text.strip()
    except AttributeError: category = np.nan

    try: ram = laptop.find('b',class_='ram').text.strip()
    except AttributeError: ram = np.nan

    try: processor = laptop.find('b',class_='processor').text.strip()
    except AttributeError: processor = np.nan

    try: display = laptop.find('b',class_='display').text.strip()
    except AttributeError: display = np.nan

    try: storage = laptop.find('b',class_='storage').text.strip()
    except AttributeError: storage = np.nan

    try: weight = laptop.find('b',class_='weight').text.strip()
    except AttributeError: weight = np.nan

    # lets append the laptop's data into the list
    scraped_data.append([brandname, laptopname, rating, age, price, disc] + 
                        [category, ram, processor, display, storage, weight])

In [23]:
df = pd.DataFrame(data=scraped_data, columns=column_names)

Now you can see that this issue has been fixed. So, when writing web scraping code, always try to think ahead and prepare for changes that may occur to the webpage

In [24]:
df.head(10)

Unnamed: 0,brand_name,laptop_name,rating,age,price,disc,category,ram,processor,display,storage,weight
0,Apple,MacBook Pro,Rating: 4/5,,$994,13%,Ultrabook,8GB,Intel Core i5 2.3GHz,IPS Panel Retina Display 2560x1600,128GB SSD,1.37kg
1,Apple,Macbook Air,Rating: 5/5,Age: 8 years,$929,5%,Ultrabook,8GB,Intel Core i5 1.8GHz,,128GB Flash Storage,1.34kg
2,HP,250 G6,Rating: 4/5,Age: 8 years,$507,8%,Notebook,,Intel Core i5 7200U 2.5GHz,Full HD 1920x1080,256GB SSD,1.86kg
3,Apple,MacBook Pro,Rating: 5/5,,$566,12%,Ultrabook,16GB,Intel Core i7 2.7GHz,IPS Panel Retina Display 2880x1800,512GB SSD,1.83kg
4,Apple,MacBook Pro,Rating: 3.5/5,Age: 4 years,$529,5%,,8GB,Intel Core i5 3.1GHz,IPS Panel Retina Display 2560x1600,256GB SSD,1.37kg
5,Acer,Aspire 3,Rating: 5/5,Age: 7 years,$580,15%,Notebook,4GB,AMD A9-Series 9420 3GHz,,500GB HDD,2.1kg
6,Apple,MacBook Pro,Rating: 5/5,Age: 8 years,$892,5%,Ultrabook,16GB,Intel Core i7 2.2GHz,IPS Panel Retina Display 2880x1800,256GB Flash Storage,2.04kg
7,Apple,Macbook Air,,Age: 4 years,$796,9%,Ultrabook,8GB,Intel Core i5 1.8GHz,1440x900,256GB Flash Storage,1.34kg
8,Asus,ZenBook UX430UN,Rating: 4.5/5,Age: 5 years,$686,4%,Ultrabook,,Intel Core i7 8550U 1.8GHz,Full HD 1920x1080,512GB SSD,1.3kg
9,Acer,Swift 3,,,$358,10%,,8GB,Intel Core i5 8250U 1.6GHz,IPS Panel Full HD 1920x1080,256GB SSD,1.6kg


--------------