# ADVANCED PANDAS: DATA IMPORTING & WEB SCRAPING

**Course Outline:**
- ***Basic Data Importing***
    - *Flat Files (.csv, .tsv, .txt)*
    - *Excel Files (.xlsx)*
    - *Other Files (.dta, .mat, .. etc)*
    - *Basic Data Importing Exercises*
- Importing Data from Databases
    - SQL Crash Course
    - Database Files (.db, .sqlite, .. etc)
- Importing Data from the Internet
    - HTML & CSS Crash Course
    - Web Scraping Basics
    - Working with JSON Data & APIs
    - Case-study: Wuzzuf.com [Web Scraping]

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

==========

## Basic Data Importing in Python

- Importing Data from Flat Files
- Importing Data from Excel Files

The Pandas I/O APIs: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

### Importing Data from Flat Files (.csv, .tsv, .txt)

In [None]:
df_flat = pd.read_csv('data/titanic_sub.csv',
                         skiprows=0,
                         skipfooter=0,
                         nrows=None,
                         header=0,
                         index_col=None,
                         usecols=None,
                         dtype={'Age': np.float64},
                         #error_bad_lines=False,
                         #warn_bad_lines=True
                         on_bad_lines='warn',
                        )
df_flat.head()

In [None]:
# .tsv file => tab separated values
df_flat = pd.read_table('data/titanic.tsv', sep='\t')
df_flat.head()

In [None]:
# File has to columns separated by a tab 
df_flat = pd.read_csv('data/seaslug.txt', sep='\t')
df_flat.head()

In [None]:
df_flat.isna().sum()

### Importing Data from Excel Files (.xlsx)

In [None]:
df_excel = pd.read_excel('data/fcc-new-coder-survey.xlsx',
                        skiprows=2,
                        header=0,
                        usecols='W:AB,AR,BG',   # col (W to AB) and Column AR and Column BG
                        sheet_name='2016',
                        parse_dates=['Part1EndTime'],
                        index_col=7)
df_excel.head()

### Importing Data from Other Files

##### Importing SAS Files

In [None]:
df_sas = pd.read_sas('data/sales.sas7bdat')
df_sas.head()

In [None]:
# pip install sas7bdat

In [None]:
from sas7bdat import SAS7BDAT
with SAS7BDAT('data/sales.sas7bdat') as file:
    df_sas = file.to_data_frame()
df_sas.head()

##### Importing Statistics Data Files (.stata)

In [None]:
df_stata = pd.read_stata('data/disarea.dta')
df_stata

##### Importing a Matlab Files (.mat)

In [None]:
import scipy.io
df_mat = scipy.io.loadmat('data/ja_data2.mat')
df_mat

==========

# THANK YOU!