# Web Scraping City Infos from Wikipedia

Our task is to use Python to:
* download the entire HTML document of a wikipedia page of a city
* find and extract the numbers we need
* send the results to SQL

To do so, we:
* download the HTML file via a **get request**
* manipulate the HTML file via `BeautifulSoup`
* send the resulting data to SQL using `SQLAlchemy`

In [17]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import config

### Definition of the scraping function

In [13]:
def get_city_info(city):

    # prepare the get request (url and headers)
    url = 'https://en.wikipedia.org/wiki/' + city 
    headers = {'Accept-Language':'en-US,en;q=0.8'} 
    response = requests.get(url, headers)
    
    # create the soup
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # extract desired info
    country = soup.select("table.vcard td.infobox-data a")[0]['title']
    latitude = soup.select(".latitude")[0].getText()
    longitude = soup.select(".longitude")[0].getText()
    population = soup.select("td.infobox-data")[10].getText()
    
    # create dataframe
    df = pd.DataFrame({
        "city_name":[city],
        "country_code":[country],
        "population":[population]
    })

    return df

### Data scraping

In [14]:
cities_list =["Berlin", "Stockholm"]

all_cities_info = []

for city in cities_list:
    all_cities_info.append(get_city_info(city))

# concatenate dataframes
final_df = pd.concat(all_cities_info, ignore_index=True)

# take off commas
final_df["population"]=final_df["population"].str.replace(",", "") 

# convert population to int
final_df["population"]=final_df["population"].astype(str).astype(int) 
final_df

Unnamed: 0,city_name,country_code,population
0,Berlin,Germany,3850809
1,Stockholm,Sweden,1000000


### Sending results to SQL

In [2]:
!pip install pymysql #installing pymysql if needed



In [19]:
import sqlalchemy #connecting to your sql
import pymysql
host = config.host
schema = config.schema
user = config.user
password = config.password
port = config.port
con = f'mysql+pymysql://{user}:{password}@{host}:{port}/{schema}'

In [16]:
#filling the empty table in sql
final_df.to_sql('cities',con=con,if_exists='append',index=False)

2