
## Stage 2 - API and Web Data Scraping
​
For this stage, you will practice what you have learned in the APIs and Web Scraping lectures. The idea is to improve your dataset with at least one additional feature.
​
Deliverables:
- Produce at least one Jupyter Notebook that shows the steps you took and the code you used to acquire and process the new data that you will "merge" with Stage 1 output dataset. Keep in mind that new data must be relevant for your reporting and conclusions. 
​
Suggested tools: pandas, pathlib, dotenv, requests, bs4, selenium
​
​
​
## Stage 3 - Data Pipeline
​
For this stage, you will practice what you have learned in the Intermediate Python and Data Engineering lectures. The idea is to complete your project delivering a full product that potential users could clone from your github and generate reports as described in the README file provided.
​
Deliverables:
- Produce a comprehensive project structure that includes all necessary files to execute your reporting app, including at least one module. 
- Bonus: You may also include any complementary functions/modules for improving your app (i.e.: command-line options, GUI, e-mailing, pdf, etc.). However, bear in mind that the core target of the project is to provide conclusions from your data therefore deliverables must focused on the data pipeline implementation.  
​
Suggested tools: pandas, pathlib, dotenv, os, argparse, email, fpdf, smtplib, reportlab, matplotlib, seaborn, plotly, tkinter, subprocess, crontab
​
​

In [5]:

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import os 


In [6]:
forbes_info = pd.read_csv('/Users/Abacuc/lab/Ironhack-Module-1-Project-Abacuc-Mendez/data/processed/Billionaires_clean.csv')
forbes_info

Unnamed: 0,id,realTimePosition,name,lastName,gender,age,country,business category,company name,worth,worthChange,image
0,7203,1,Jeff bezos,Bezos,M,54.0,,Technology,Amazon,112.0 BUSD,0.0 millions USD,https://specials-images.forbesimg.com/imageser...
1,1824,2,Bill gates,Gates,M,62.0,,Technology,Microsoft,90.0 BUSD,-0.001 millions USD,https://specials-images.forbesimg.com/imageser...
2,7738,3,Warren buffett,Buffett,M,87.0,United States,Finance and Investments,Berkshire Hathaway,84.0 BUSD,-0.002 millions USD,https://specials-images.forbesimg.com/imageser...
3,9504,4,Bernard arnault,Arnault,,69.0,France,Fashion & Retail,LVMH,72.0 BUSD,0.0 millions USD,https://specials-images.forbesimg.com/imageser...
4,9302,5,Mark zuckerberg,Zuckerberg,M,35.0,United States,Technology,Facebook,71.0 BUSD,0.0 millions USD,https://specials-images.forbesimg.com/imageser...
...,...,...,...,...,...,...,...,...,...,...,...,...
2203,7059,2134,Zhao xiaoqiang,Zhao,,51.0,,Fashion & Retail,"fashion, entertainment",1.0 BUSD,0.0 millions USD,https://specials-images.forbesimg.com/imageser...
2204,2817,2134,Zhou liangzhang,Zhou,M,55.0,,Manufacturing,electrical equipment,1.0 BUSD,nan millions USD,https://specials-images.forbesimg.com/imageser...
2205,6862,1856,Zhu xingming,Zhu,M,51.0,China,Manufacturing,electrical equipment,1.0 BUSD,0.0 millions USD,https://specials-images.forbesimg.com/imageser...
2206,8437,1978,Zhuo jun,Zhuo,F,52.0,,Manufacturing,printed circuit boards,1.0 BUSD,0.0 millions USD,https://specials-images.forbesimg.com/imageser...


In [7]:
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")

In [8]:
table = soup.find_all('table',{'class':'wikitable sortable mw-datatable'})[0]
table


<table class="wikitable sortable mw-datatable" style="margin:auto;text-align:right">
<tbody><tr>
<th data-sort-type="number">Rank</th>
<th>Country<br/><small>(or dependent territory)</small></th>
<th>Population</th>
<th>% of World
<p>Population
</p>
</th>
<th>Date</th>
<th class="unsortable">Source
</th></tr>
<tr>
<td>1</td>
<td align="left"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/45px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" width="23"/></span> <a href="/wiki/Demographics_of

In [9]:

# tr represent the table rows
rows = table.find_all('tr')
rows_parsed = [row.text for row in rows]
rows_parsed

['\nRank\nCountry(or dependent territory)\nPopulation\n% of World\nPopulation\n\n\nDate\nSource\n',
 '\n1\n\xa0China[b]\n1,400,781,440\n18.1%\n7 Jan 2020\nNational population clock[3]\n',
 '\n2\n\xa0India\n1,357,041,500\n17.5%\n7 Jan 2020\nNational population clock[4]\n',
 '\n3\n\xa0United States[c]\n330,546,475\n4.26%\n7 Jan 2020\nNational population clock[5]\n',
 '\n4\n\xa0Indonesia\n266,911,900\n3.44%\n1 Jul 2019\nNational annual projection[6]\n',
 '\n5\n\xa0Pakistan\n218,198,000\n2.81%\n7 Jan 2020\nNational population clock[7]\n',
 '\n6\n\xa0Brazil\n210,951,255\n2.72%\n7 Jan 2020\nNational population clock[8]\n',
 '\n7\n\xa0Nigeria\n200,963,599\n2.59%\n1 Jul 2019\nUN Projection[2]\n',
 '\n8\n\xa0Bangladesh\n167,888,084\n2.16%\n7 Jan 2020\nNational population clock[9]\n',
 '\n9\n\xa0Russia[d]\n146,780,720\n1.89%\n1 Jan 2019\nNational estimate[10]\n',
 '\n10\n\xa0Mexico\n127,191,826\n1.64%\n1 Jan 2020\nNational annual projection[11]\n',
 '\n11\n\xa0Japan\n126,150,000\n1.63%\n1 Dec 20

In [10]:

def smart_parser(row_text):
    row_text = row_text.replace('\nPopulation\n\n\n', '\nPopulation\n').strip('\n')
    row_text = row_text.replace('\n\n', '\n').strip('\n')
    row_text = row_text.replace('\nPopulation\n% of World\n', '\nPopulation % of World\n').strip('\n')
    row_text = re.sub('\[\d\]', '', row_text)
    return list(map(lambda x: x.strip(), row_text.split('\n')))

well_parsed = list(map(lambda x: smart_parser(x), rows_parsed))

well_parsed

[['Rank',
  'Country(or dependent territory)',
  'Population % of World',
  'Population',
  'Date',
  'Source'],
 ['1',
  'China[b]',
  '1,400,781,440',
  '18.1%',
  '7 Jan 2020',
  'National population clock'],
 ['2',
  'India',
  '1,357,041,500',
  '17.5%',
  '7 Jan 2020',
  'National population clock'],
 ['3',
  'United States[c]',
  '330,546,475',
  '4.26%',
  '7 Jan 2020',
  'National population clock'],
 ['4',
  'Indonesia',
  '266,911,900',
  '3.44%',
  '1 Jul 2019',
  'National annual projection'],
 ['5',
  'Pakistan',
  '218,198,000',
  '2.81%',
  '7 Jan 2020',
  'National population clock'],
 ['6',
  'Brazil',
  '210,951,255',
  '2.72%',
  '7 Jan 2020',
  'National population clock'],
 ['7', 'Nigeria', '200,963,599', '2.59%', '1 Jul 2019', 'UN Projection'],
 ['8',
  'Bangladesh',
  '167,888,084',
  '2.16%',
  '7 Jan 2020',
  'National population clock'],
 ['9',
  'Russia[d]',
  '146,780,720',
  '1.89%',
  '1 Jan 2019',
  'National estimate[10]'],
 ['10',
  'Mexico',
  '127,19

In [11]:

colnames = well_parsed[0]
data = well_parsed[1:]

df = pd.DataFrame(data, columns=colnames)

df['Country(or dependent territory)']=df['Country(or dependent territory)'].replace(['\[(.*?)\]','\(([^\)]+)\)'], ['',''], regex=True)

df.columns = ['Rank', 'country','Population',' % of World Population','Date','Source']

df

Unnamed: 0,Rank,country,Population,% of World Population,Date,Source
0,1,China,1400781440,18.1%,7 Jan 2020,National population clock
1,2,India,1357041500,17.5%,7 Jan 2020,National population clock
2,3,United States,330546475,4.26%,7 Jan 2020,National population clock
3,4,Indonesia,266911900,3.44%,1 Jul 2019,National annual projection
4,5,Pakistan,218198000,2.81%,7 Jan 2020,National population clock
...,...,...,...,...,...,...
236,–,Tokelau,1400,0.0000180%,1 Jul 2018,National annual estimate[94]
237,–,Vatican City,799,0.0000103%,1 Jul 2019,UN projection
238,–,Cocos Islands,538,0.00000693%,30 Jun 2018,National estimate[197]
239,–,Pitcairn Islands,50,0.000000644%,1 Jan 2019,National estimate[198]


In [12]:
forbes_info_result=pd.merge(forbes_info, df, on='country')
forbes_info_result


Unnamed: 0,id,realTimePosition,name,lastName,gender,age,country,business category,company name,worth,worthChange,image,Rank,Population,% of World Population,Date,Source
0,7738,3,Warren buffett,Buffett,M,87.0,United States,Finance and Investments,Berkshire Hathaway,84.0 BUSD,-0.002 millions USD,https://specials-images.forbesimg.com/imageser...,3,330546475,4.26%,7 Jan 2020,National population clock
1,9302,5,Mark zuckerberg,Zuckerberg,M,35.0,United States,Technology,Facebook,71.0 BUSD,0.0 millions USD,https://specials-images.forbesimg.com/imageser...,3,330546475,4.26%,7 Jan 2020,National population clock
2,3444,8,Larry ellison,Ellison,,73.0,United States,Technology,software,58.5 BUSD,-0.001 millions USD,https://specials-images.forbesimg.com/imageser...,3,330546475,4.26%,7 Jan 2020,National population clock
3,6325,11,Michael bloomberg,Bloomberg,M,76.0,United States,Media & Entertainment,Bloomberg LP,50.0 BUSD,0.0 millions USD,https://specials-images.forbesimg.com/imageser...,3,330546475,4.26%,7 Jan 2020,National population clock
4,4765,9,Larry page,Page,M,45.0,United States,Technology,Google,48.8 BUSD,0.0 millions USD,https://specials-images.forbesimg.com/imageser...,3,330546475,4.26%,7 Jan 2020,National population clock
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
647,6824,1543,Binod chaudhary,Chaudhary,M,63.0,Nepal,Diversified,diversified,1.5 BUSD,0.0 millions USD,https://specials-images.forbesimg.com/imageser...,48,29609623,0.382%,1 Jul 2019,National annual projection[47]
648,9327,1633,Ilona herlin,Herlin,F,53.0,Finland,Manufacturing,"elevators, escalators",1.5 BUSD,0.0 millions USD,https://specials-images.forbesimg.com/imageser...,114,5527405,0.0712%,30 Nov 2019,National monthly estimate[107]
649,7984,1677,Kutayba alghanim,Alghanim,M,72.0,Kuwait,Diversified,diversified,1.4 BUSD,0.0 millions USD,https://specials-images.forbesimg.com/imageser...,124,4420110,0.0570%,1 Jan 2019,National annual estimate[117]
650,3532,1865,Tran dinh long,Tran dinh,,57.0,Vietnam,Manufacturing,"steel, heavy industry",1.3 BUSD,0.0 millions USD,https://specials-images.forbesimg.com/imageser...,15,96208984,1.24%,1 Apr 2019,Preliminary 2019 census result[16]


In [19]:
output_folder = 'processed'
if not os.path.exists(output_folder):
    os.mkdir(output_folder)

forbes_info_result.to_csv(f'{output_folder}/Results.csv', index=False)