## Importing from more data sources
Data doesn't always come in the form of nice csv or excel files. They come from different sources
Sometime, we might even have to scrap them from website using webcralwlers etc


## Introduction

In this lab, you'll get some practice with loading files with summary or metadata, and if you find that easy, the optional "level up" content covers loading data from a corrupted csv file!

## Objectives
You will be able to:
- Import data from csv files and Excel files
- Import data from a url , html files, databases & many more

##  We also be looking at loading Files with Summary or Meta Data

Load either of the files `'Zipcode_Demos.csv'` or `'Zipcode_Demos.xlsx'`. What's going on with this dataset? Clean it up into a useable format and describe the nuances of how the data is currently formatted.

All data files are stored in a folder titled `'data'`

#### In the previous session we saw how to import other files as below

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [None]:
# data = pd.read_excel("../data/Transactions_Zenapt.xlsx',sheetname = 'Expense'")
# data.head(3)

# <a id="7">Opening Excel Files In Jupyter Notebook</a>

In [None]:
pd.options.mode.chained_assignment = None

path='credit_card_defaults(UCI).xls'
pdexcel = pd.ExcelFile(path)
print (pdexcel.sheet_names)
df = pdexcel.parse("Data" , skiprows = 1) # i used skiprows = 1 here since DF was Multi Indexed

print (df.shape)
df.head()


# <a id="7">Importing from URL</a>

In [8]:
url = "http://www.basketball-reference.com/leagues/NBA_2015_totals.html"
data = pd.read_html(url)  # read from the URL
data[0].iloc[:, 0:20].head() # check 5 rows (return only 10 columns)

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA
0,1,Quincy Acy,PF,24,NYK,68,22,1287,152,331,0.459,18,60,0.3,134,271,0.494,0.486,76,97
1,2,Jordan Adams,SG,20,MEM,30,0,248,35,86,0.407,10,25,0.4,25,61,0.41,0.465,14,23
2,3,Steven Adams,C,21,OKC,70,67,1771,217,399,0.544,0,2,0.0,217,397,0.547,0.544,103,205
3,4,Jeff Adrien,PF,28,MIN,17,0,215,19,44,0.432,0,0,,19,44,0.432,0.432,22,38
4,5,Arron Afflalo,SG,29,TOT,78,72,2502,375,884,0.424,118,333,0.354,257,551,0.466,0.491,167,198



# <a id="7">Importing the titanic data from a Github url</a>

In [12]:
titanic = pd.read_csv("https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv",
                         sep='\t')
titanic.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


### Let's see all passengers on each cabin class

In [13]:
titanic[["Name","Pclass"]].head()

Unnamed: 0,Name,Pclass
0,"Braund, Mr. Owen Harris",3
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1
2,"Heikkinen, Miss. Laina",3
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1
4,"Allen, Mr. William Henry",3


In [14]:
import os
import sys

In [17]:
dates = pd.date_range('1/1/2000', periods=7)
ts = pd.Series(np.arange(7), index=dates)
ts.to_csv('../data/newtseries.csv')
!cat ../data/newtseries.csv

,0
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6


In [66]:
# a) Read the CSV file above into a data frame ‘demo’.

pd.options.mode.chained_assignment = None

path='../data/demo.csv'
demo=pd.read_csv(path, sep=",", na_values='.')
demo.head(4)

Unnamed: 0,sex,age,siblings,color,income
0,female,27.6,1,red,6611.0
1,female,49.3,2,green,7652.0
2,male,16.8,3,red,1778.0
3,male,14.4,0,red,567.0


# <a id="7">Getting data  from txt files</a>

+ Read the file ‘survey.txt’ in the default folder into a data frame ‘survey1’.
+ Use the first line as the columns and identify dot ‘.’ in the file as missing value.


In [68]:
surveydf=pd.read_table('../data/survey.txt', sep=' ', header=0, na_values='.')
surveydf.head()

Unnamed: 0,age,worktype,education,marrital,relationship,gender,capitalgain,workhours,income
0,25,Private,11th,Never-married,Own-child,Male,0.0,40.0,<=50K
1,38,Private,,Married-civ-spouse,Husband,",",0.0,50.0,<=50K
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Husband,Male,0.0,40.0,>50K
3,44,Private,Some-college,Married-civ-spouse,Husband,,7688.0,40.0,>50K
4,34,Private,10th,Never-married,Not-in-family,Male,0.0,30.0,<=50K



# <a id="7">Getting the titanic data</a>

In [None]:
import os
TITANIC_PATH = os.path.join("datasets","titanic")

In [None]:
def load_titanic_data(filename, titanic_path=TITANIC_PATH):
    csv_path = os.join(titanic_path, filename)
    return pd.read_csv(csv_path)

In [None]:
train_data = load_titanic_data("train.csv")
test_data = load_titanic_data("test.csv")


# <a id="7">Getting the California housing Data from Github</a>

In [56]:
import tarfile
import urllib

In [57]:
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets","housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_the_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing_tgz")  # the housing.tgz is the file with data
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [58]:
fetch_the_data()

In [59]:
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [60]:
data = load_housing_data()
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY



# <a id="7">Reading File paths in Python To connect to Projects</a>

In [None]:
# This gives you the location of the directory
# import os
print ('default',os.getcwd())

pd.set_option('display.max_rows', 200)

pd.options.mode.chained_assignment = None

path = '../data/analytic_data2019.csv'  # note i kept the csv file in dataset file so its easy to call from there
df=pd.read_csv(path, na_values='NA' , skiprows=1)
df



# <a id="7">Read a large data file (.CSV) by chunk into a Pandas Data Frame. Here chunkSize = 100000</a>

+ Sometimes we might have really massive dataframes from 2gb to 100gb and we will like to just sample the data

In [63]:
import os , pandas , numpy

'''

Read a large data file (.CSV) by chunk into a Pandas Data Frame. Here chunkSize = 100000

'''

path='../data/real_estate_db.csv'
reader = pd.read_csv(path, encoding = 'latin1', iterator=True)
loop = True
chunkSize = 10000
chunks = []
while loop:
    try:
        chunk = reader.get_chunk(chunkSize)
        chunks.append(chunk)
    except StopIteration:
        loop = False
        print ("Iteration is stopped.")

df = pd.concat(chunks, ignore_index=True)
df.head()        

Iteration is stopped.


Unnamed: 0,UID,BLOCKID,SUMLEVEL,COUNTYID,STATEID,state,state_ab,city,place,type,...,female_age_mean,female_age_median,female_age_stdev,female_age_sample_weight,female_age_samples,pct_own,married,married_snp,separated,divorced
0,220336,,140,16,2,Alaska,AK,Unalaska,Unalaska City,City,...,32.78177,31.91667,19.31875,440.46429,1894.0,0.25053,0.47388,0.30134,0.03443,0.09802
1,220342,,140,20,2,Alaska,AK,Eagle River,Anchorage,City,...,38.97956,39.66667,20.05513,466.65478,1947.0,0.94989,0.52381,0.01777,0.00782,0.13575
2,220343,,140,20,2,Alaska,AK,Jber,Anchorage,City,...,22.20427,23.16667,13.86575,887.67805,3570.0,0.00759,0.50459,0.06676,0.01,0.01838
3,220345,,140,20,2,Alaska,AK,Anchorage,Point Mackenzie,City,...,37.0075,34.0,22.06347,281.4942,1049.0,0.20247,0.44428,0.05933,0.0,0.21563
4,220347,,140,20,2,Alaska,AK,Anchorage,Anchorage,City,...,34.96611,31.75,20.49887,655.98066,2905.0,0.56936,0.51034,0.08315,0.06731,0.08711


# Json Data

In [19]:
import json

In [25]:
dataframe = """
{"name": "Joshua",
 "places_lived": ["United Kingdom","Spain", "Germany","Italy","Indonesia","Philipines"],
 "pet": null,
 "siblings": [{"name": "Caleb", "age": 28, "pets": ["Zeus", "Beiber"]},
              {"name": "Abi", "age": 20,
               "pets": ["Minu", "Lilly", "Da-rocha"]}]
}
"""

In [26]:
result = json.loads(dataframe)
result

{'name': 'Joshua',
 'places_lived': ['United Kingdom',
  'Spain',
  'Germany',
  'Italy',
  'Indonesia',
  'Philipines'],
 'pet': None,
 'siblings': [{'name': 'Caleb', 'age': 28, 'pets': ['Zeus', 'Beiber']},
  {'name': 'Abi', 'age': 20, 'pets': ['Minu', 'Lilly', 'Da-rocha']}]}

In [27]:
tojson = json.dumps(result)

In [28]:
my_siblings = pd.DataFrame(result["siblings"] , 
                           columns=["name","age","pets"])
my_siblings

Unnamed: 0,name,age,pets
0,Caleb,28,"[Zeus, Beiber]"
1,Abi,20,"[Minu, Lilly, Da-rocha]"


# XML and HTML

In [29]:
dataframe = pd.read_html("../data/fdic_failed_bank_list.html")
failed_banks = dataframe[0]
failed_banks.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Allied Bank,Mulberry,AR,91,Today's Bank,"September 23, 2016","November 17, 2016"
1,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,"August 19, 2016","November 17, 2016"
2,First CornerStone Bank,King of Prussia,PA,35312,First-Citizens Bank & Trust Company,"May 6, 2016","September 6, 2016"
3,Trust Company Bank,Memphis,TN,9956,The Bank of Fayette County,"April 29, 2016","September 6, 2016"
4,North Milwaukee State Bank,Milwaukee,WI,20364,First-Citizens Bank & Trust Company,"March 11, 2016","June 16, 2016"


In [None]:
# Let's make a copy of the Dataframe

In [30]:
df2 = failed_banks.copy()

# converting to date column to datetime bojects
+ We will go into details in our datetime lesson

In [31]:
timestamp = pd.to_datetime(failed_banks["Closing Date"])
timestamp

0     2016-09-23
1     2016-08-19
2     2016-05-06
3     2016-04-29
4     2016-03-11
         ...    
542   2001-07-27
543   2001-05-03
544   2001-02-02
545   2000-12-14
546   2000-10-13
Name: Closing Date, Length: 547, dtype: datetime64[ns]

# APIs

+ We will discuss more into details about APIs in the Data Engineering session

In [32]:
import requests

In [33]:
link = "https://api.github.com/repos/pandas-dev/pandas/issues"
response = requests.get(link)
response   # 200 means all good, 404 means error

<Response [200]>

# lets put it into json

In [34]:
data = response.json()

In [39]:
data[2]["user"]

{'login': 'ivanovmg',
 'id': 41443370,
 'node_id': 'MDQ6VXNlcjQxNDQzMzcw',
 'avatar_url': 'https://avatars3.githubusercontent.com/u/41443370?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/ivanovmg',
 'html_url': 'https://github.com/ivanovmg',
 'followers_url': 'https://api.github.com/users/ivanovmg/followers',
 'following_url': 'https://api.github.com/users/ivanovmg/following{/other_user}',
 'gists_url': 'https://api.github.com/users/ivanovmg/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/ivanovmg/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/ivanovmg/subscriptions',
 'organizations_url': 'https://api.github.com/users/ivanovmg/orgs',
 'repos_url': 'https://api.github.com/users/ivanovmg/repos',
 'events_url': 'https://api.github.com/users/ivanovmg/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/ivanovmg/received_events',
 'type': 'User',
 'site_admin': False}

# Converting the json format into pandas dataframe

In [40]:
new_data = pd.DataFrame(data, columns=["number","title","labels","state"])
new_data.head()

Unnamed: 0,number,title,labels,state
0,37742,"BUG: rolling.apply() with engine=""numba"" cause...","[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open
1,37741,Backport PR #37661 on branch 1.1.x: BUG: Rolli...,"[{'id': 233160, 'node_id': 'MDU6TGFiZWwyMzMxNj...",open
2,37739,TYP: fix mypy ignored err in pandas/io/formats...,"[{'id': 1280988427, 'node_id': 'MDU6TGFiZWwxMj...",open
3,37738,TYP: fix mypy ignored error in pandas/io/forma...,"[{'id': 1280988427, 'node_id': 'MDU6TGFiZWwxMj...",open
4,37737,BUG: Output all columns even if subsetting col...,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open


In [None]:
# Taking a quick look into the labels using a Nested loop
for item in new_data:
    for lab in new_data["labels"]:
        print(lab)

# Interacting with Databases

In [43]:
import sqlite3
#connecting with the database.
db = sqlite3.connect("my_database4.db")
# Drop table if it already exist using execute() method.
db.execute("drop table if exists test")
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""
con = sqlite3.connect('mydata2.sqlite')
con.execute(query)
con.commit()

In [44]:
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()

In [45]:
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

## Convert the above into a pandas dataframe

In [46]:
cursor.description
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5


In [47]:
#connecting with the database.
db = sqlite3.connect("my_database5.db")
# Drop table if it already exist using execute() method.
db.execute("drop table if exists grades1")
# Create table as per requirement
db.execute("create table grades1(id int, name text, score int)")
#inserting values inside the created table
db.execute("insert into grades1(id, name, score) values(101, 'John',99 )")
db.execute("insert into grades1(id, name, score) values(102, 'Gary',90 )")
db.execute("insert into grades1(id, name, score) values(103, 'James', 80 )")
db.execute("insert into grades1(id, name, score) values(104, 'Cathy', 85 )")
db.execute("insert into grades1(id, name, score) values(105, 'Kris',95 )")

<sqlite3.Cursor at 0x11ef07d50>

In [48]:
db.commit()

In [50]:
results = db.execute("Select * from grades1 order by id")
for x in results:
    print(x)
print("-" * 60)    

(101, 'John', 99)
(102, 'Gary', 90)
(103, 'James', 80)
(104, 'Cathy', 85)
(105, 'Kris', 95)
------------------------------------------------------------


In [51]:
results = db.execute("select * from grades1 where name = 'Gary' ")
for row in results: print(row)
print("-"* 60 )

(102, 'Gary', 90)
------------------------------------------------------------


In [52]:
results = db.execute("select * from grades1 where score >= 90 ")
for row in results:
    print(row)
print("-" * 60 )

(101, 'John', 99)
(102, 'Gary', 90)
(105, 'Kris', 95)
------------------------------------------------------------


In [53]:
results = db.execute("select name, score from grades1 order by score desc ")
for row in results:
    print(row)
print("-" * 60 )

('John', 99)
('Kris', 95)
('Gary', 90)
('Cathy', 85)
('James', 80)
------------------------------------------------------------


In [54]:
results = db.execute("select name, score from grades1 order by score")
for row in results:
    print(row)
print("-" * 60 )

('James', 80)
('Cathy', 85)
('Gary', 90)
('Kris', 95)
('John', 99)
------------------------------------------------------------


In [55]:
results = db.execute("select name, score from grades1 order by score")
for row in results:
    print(row)

('James', 80)
('Cathy', 85)
('Gary', 90)
('Kris', 95)
('John', 99)


# Summary:
+ In this session we had a lot of fun with various ways of importing files in pandas