# WEB SCRAPING WITH PANDAS

Web scraping with pandas is quite restricted and simple.

We can only extract tables, specifically HTML tables from the web

In [8]:
# import the pandas library as pd
import pandas as pd
# import regular expression
import re

In [11]:
# read the html on the provided url
flower = pd.read_html('https://en.wikipedia.org/wiki/List_of_plants_with_symbolism')

In [13]:
# check our output
flower

[    0                                                  1
 0 NaN  This article includes a list of general refere...,
                    0                                                  1  \
 0              Plant                                            Meaning   
 1  Asparagus foliage                                        Fascination   
 2             Bamboo                     Longevity, strength, and grace   
 3       Green willow                                         False love   
 4          Mistletoe  Used to signify a meeting place where no viole...   
 5  Maple Tree/leaves             balance, love, longevity and abundance   
 
                    2  
 0  Region or culture  
 1             Europe  
 2              China  
 3            Britain  
 4             Druids  
 5            Various  ,
                                                      0  \
 0    Flower Meaning Acacia rose or white Secret Lov...   
 1                                               Flower   
 2 

In [14]:
# view all the tables wjile assigning their index to know the excat table to extract
for idx,val in enumerate(flower):
    print('*' * 100)
    print('====>',idx)
    print(val)

****************************************************************************************************
====> 0
    0                                                  1
0 NaN  This article includes a list of general refere...
****************************************************************************************************
====> 1
                   0                                                  1  \
0              Plant                                            Meaning   
1  Asparagus foliage                                        Fascination   
2             Bamboo                     Longevity, strength, and grace   
3       Green willow                                         False love   
4          Mistletoe  Used to signify a meeting place where no viole...   
5  Maple Tree/leaves             balance, love, longevity and abundance   

                   2  
0  Region or culture  
1             Europe  
2              China  
3            Britain  
4             Druids  
5   

In [15]:
# there are six tables
len(flower)

6

In [16]:
# we wont be using index 0
flower[0]

Unnamed: 0,0,1
0,,This article includes a list of general refere...


In [17]:
flower[1] # this table contains plant and not specifically flower
# we rename the variable for better understanding
plant = flower[1]
plant

Unnamed: 0,0,1,2
0,Plant,Meaning,Region or culture
1,Asparagus foliage,Fascination,Europe
2,Bamboo,"Longevity, strength, and grace",China
3,Green willow,False love,Britain
4,Mistletoe,Used to signify a meeting place where no viole...,Druids
5,Maple Tree/leaves,"balance, love, longevity and abundance",Various


We would be using __row 1__ as our table head 

### we have extracted our table, why dont we clean it 

From the table above, we notice the columns head named as 0,1,2 while the correct column names is in row 1

#### we use __iloc__ to extract just row 1 to a Series

In [18]:
plant.iloc[0]

0                Plant
1              Meaning
2    Region or culture
Name: 0, dtype: object

#### we then extract their values from the series to a numpy array

In [19]:
type(plant.iloc[0].values)

numpy.ndarray

#### convert array to list type

In [20]:
plant.iloc[0].values.tolist()

['Plant', 'Meaning', 'Region or culture']

In [21]:
# assign it to new_plant_column
new_plant_column = plant.iloc[0].values.tolist()

In [22]:
# assign the new column names to column heads
plant.columns = new_plant_column

In [23]:
plant

Unnamed: 0,Plant,Meaning,Region or culture
0,Plant,Meaning,Region or culture
1,Asparagus foliage,Fascination,Europe
2,Bamboo,"Longevity, strength, and grace",China
3,Green willow,False love,Britain
4,Mistletoe,Used to signify a meeting place where no viole...,Druids
5,Maple Tree/leaves,"balance, love, longevity and abundance",Various


In [24]:
plant = plant.drop(0, axis=0)
plant

Unnamed: 0,Plant,Meaning,Region or culture
1,Asparagus foliage,Fascination,Europe
2,Bamboo,"Longevity, strength, and grace",China
3,Green willow,False love,Britain
4,Mistletoe,Used to signify a meeting place where no viole...,Druids
5,Maple Tree/leaves,"balance, love, longevity and abundance",Various


__Exporting our plant table to csv file and storing it in a folder called plant-flower-meaning__

In [25]:
# plant.to_csv('plant-flower-meaning/plant and their meaning.csv')

In [26]:
flowers = flower[2]
flowers

Unnamed: 0,0,1,2
0,Flower Meaning Acacia rose or white Secret Lov...,"Flower Meaning Ivy Dependence, endurance, fait...",
1,Flower,Flower,Meaning
2,Acacia,rose or white,Secret Love
3,Acacia,yellow,Elegance
4,Abatina [1],Abatina [1],Fickleness[2]
...,...,...,...
287,Wolfsbane,Wolfsbane,Misanthropy
288,Wormwood,Wormwood,"Absence, bitter sorrow"
289,Yarrow,Yarrow,"Healing, inspiration"
290,Ylang-Ylang,Ylang-Ylang,Never-ending love


### for the above table, we notice the correct column names in the second row row[1]

In [27]:
# we can go ahead to repeat same procedure
flowers.columns = flowers.iloc[1].values.tolist()

#### we would be removing the first two rows because we do not need them

In [28]:
flowers = flowers.drop([0,1], axis=0)
flowers

Unnamed: 0,Flower,Flower.1,Meaning
2,Acacia,rose or white,Secret Love
3,Acacia,yellow,Elegance
4,Abatina [1],Abatina [1],Fickleness[2]
5,Acanthus,Acanthus,"Art, Immortality, Rebirth"
6,Agrimonia,Agrimonia,"Thankfulness, gratitude"
...,...,...,...
287,Wolfsbane,Wolfsbane,Misanthropy
288,Wormwood,Wormwood,"Absence, bitter sorrow"
289,Yarrow,Yarrow,"Healing, inspiration"
290,Ylang-Ylang,Ylang-Ylang,Never-ending love


#### we have notice wikipedia pattern of html reference link

This patterns are most numbers enclosed in square brackets such as [2], [4], [1]

- our goal is to use __regular expression to remove such patterns__

#### we would be using the __apply__ method in pandas by writing a function 

In [29]:
# defining our functions
def remove_brackets_values(cell):
    if isinstance(cell, str):
        return re.sub(r'\[.*?\]', '', cell)
    return cell

In [30]:
# using the apply function to apply the custom function on each cell
flowers = flowers.applymap(remove_brackets_values)

In [31]:
# resetting the index
flowers.reset_index(drop=True, inplace=True)

In [32]:
flowers

Unnamed: 0,Flower,Flower.1,Meaning
0,Acacia,rose or white,Secret Love
1,Acacia,yellow,Elegance
2,Abatina,Abatina,Fickleness
3,Acanthus,Acanthus,"Art, Immortality, Rebirth"
4,Agrimonia,Agrimonia,"Thankfulness, gratitude"
...,...,...,...
285,Wolfsbane,Wolfsbane,Misanthropy
286,Wormwood,Wormwood,"Absence, bitter sorrow"
287,Yarrow,Yarrow,"Healing, inspiration"
288,Ylang-Ylang,Ylang-Ylang,Never-ending love


In [33]:
flowers['Flower']

Unnamed: 0,Flower,Flower.1
0,Acacia,rose or white
1,Acacia,yellow
2,Abatina,Abatina
3,Acanthus,Acanthus
4,Agrimonia,Agrimonia
...,...,...
285,Wolfsbane,Wolfsbane
286,Wormwood,Wormwood
287,Yarrow,Yarrow
288,Ylang-Ylang,Ylang-Ylang


In [34]:
flowers.columns

Index(['Flower', 'Flower', 'Meaning'], dtype='object')

### we would be changing the value of the second column to 'Attribute'


In [35]:
# we pick the second index of the columns and reassign it 
flowers.columns.values[1] = 'Attribute'

In [36]:
flowers.columns

Index(['Flower', 'Attribute', 'Meaning'], dtype='object')

In [37]:
# confirming
flowers

Unnamed: 0,Flower,Attribute,Meaning
0,Acacia,rose or white,Secret Love
1,Acacia,yellow,Elegance
2,Abatina,Abatina,Fickleness
3,Acanthus,Acanthus,"Art, Immortality, Rebirth"
4,Agrimonia,Agrimonia,"Thankfulness, gratitude"
...,...,...,...
285,Wolfsbane,Wolfsbane,Misanthropy
286,Wormwood,Wormwood,"Absence, bitter sorrow"
287,Yarrow,Yarrow,"Healing, inspiration"
288,Ylang-Ylang,Ylang-Ylang,Never-ending love


#### At the end of the this course

- we import the pandas library
- we extracted table from the given url
- we loop through each table to understand and know the needed table for our goal
- we extracted a table for plant and flower
- we rename the column
- we remove or drop useless or wrong data from our table
- we further clean our values to remove unnessary values like [9]
- we rename duplicated columns head
- we export our files

In [38]:
# flowers.to_csv('plant-flower-meaning/flowers and their meaning.csv')