# ADVANCED PANDAS: DATA IMPORTING & WEB SCRAPING

Course Outline:
- Basic Data Importing
    - Flat Files (.csv, .tsv, .txt)
    - Excel Files (.xlsx)
    - Other Files (.dta, .mat, .. etc)
- Importing Data from Relational Databases
    - SQL Crash Course
    - Database Files (.db, .sqlite, .. etc)
- ***Importing Data from the Internet***
    - ***HTML & CSS Crash Course***
    - ***Working with JSON Data & APIs***
    - Web Scraping Basics
    - Case-study: Wuzzuf.com [Web Scraping]

### Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

==========

## Importing Data from the Internet (Web Scraping)

- HTML & CSS Crash Course
- Working with JSON Data & APIs
- Web Scraping Basics
    - Case-study: Wuzzuf.com [Web Scraping]

In [None]:
from IPython.display import Image
Image("data/ws.jpg")

==========

### HTML & CSS Basics

HTML, CSS, & JS Online Editors
- https://codepen.io/pen/
- https://liveweave.com/

HTML Tutorial:
- https://www.w3schools.com/html/
- https://www.sololearn.com/learning/1014

CSS Tutorial:
- https://www.w3schools.com/css/
- https://www.sololearn.com/learning/1023

##### Your Very First Web Page (Demo)

##### Reading HTML Web Page in Pandas

In [None]:
# specify the web page address
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)'

In [None]:
# reading the web page using 'pd.read_html()' method
countries = pd.read_html(url, header=1)

In [None]:
type(countries)

In [None]:
len(countries)

In [None]:
df_countries = countries[0]
df_countries

In [None]:
# renaming columns
df_countries.rename({"Country (or territory)":"country", "Region":"region","Estimate.1":"gdp"}, axis=1, inplace=True)

In [None]:
df_countries

In [None]:
# getting specific features
df_countries = df_countries.loc[:,['country', 'region', 'gdp']]
df_countries

In [None]:
# manipulating the strings of the 'country' column to remove "†" mark
df_countries['country'] = df_countries['country'].str.replace("†","")

In [None]:
# Can we find the missing values
df_countries['gdp'].isna().sum()

In [None]:
# Let's get the maximum gdp from each region
df_countries.groupby("region").max()

In [None]:
# Reading HTML Web Page doesn't always work
pd.read_html("https://www.worldometers.info/gdp/gdp-by-country/")

==========

### Working with JSON Data & APIs

##### JavaScript Object Notation (JSON)

JSON Tutorial: https://www.w3schools.com/js/js_json_intro.asp

In [None]:
from IPython.display import Image
Image("data/json.jpg")

In [None]:
json = '''{
  "name": ["Ahmed Radwan", "Mustafa Othman", "Siddiq Burhan", "Omnia Nasser"],
  "salary": [64000, 73200, 76400, 94300],
  "occupation": [
    "Software Technician",
    "Data Scientist",
    "Business Consultant",
    "Aerospace Engineer"
  ]
}'''

In [None]:
df_json = pd.read_json(json)
df_json

In [None]:
df_json = pd.read_json('data/folks.json')
df_json

##### Application Programming Interfaces (APIs)

In [None]:
from IPython.display import Image
Image("data/apis.png")

- Facebook / Twitter APIs
- The Open Movie Database: https://www.themoviedb.org/
- The Open Movie Database API: https://developers.themoviedb.org/3/getting-started/introduction

In [None]:
import requests
api_url = "https://api.themoviedb.org/3/movie/399566?api_key=51529d53b81238483643314ff5613da9"

# Call the API
response = requests.get(api_url)

In [None]:
# Isolate the JSON data from the response object
data = response.json()
data

In [None]:
genres = pd.DataFrame(data['genres'])
genres

In [None]:
genres['name']

==========

# THANK YOU!