## Importing from more data sources
Data doesn't always come in the form of nice csv or excel files. They come from different sources
Sometime, we might even have to scrap them from website using webcralwlers etc


## Introduction

In this lab, you'll get some practice with loading files with summary or metadata, and if you find that easy, the optional "level up" content covers loading data from a corrupted csv file!

## Objectives
You will be able to:
- Import data from csv files and Excel files
- Import data from a url , html files, databases & many more

##  We will also be looking at loading Files with Summary or Meta Data

Load either of the files `'Zipcode_Demos.csv'` or `'Zipcode_Demos.xlsx'`. What's going on with this dataset? Clean it up into a useable format and describe the nuances of how the data is currently formatted.

All data files are stored in a folder titled `'data'`

#### In the previous session we saw how to import other files as below

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [None]:
# data = pd.read_excel("../data/Transactions_Zenapt.xlsx',sheetname = 'Expense'")
# data.head(3)

# <a id="7">Opening Excel Files In Jupyter Notebook</a>

In [3]:
pd.options.mode.chained_assignment = None

path='../data/FraudDetection.xlsx'
pdexcel = pd.ExcelFile(path)
print (pdexcel.sheet_names)
df = pdexcel.parse("Data") 
print (df.shape)
df.head()

['Data']
(16281, 15)


Unnamed: 0,Transaction_ID,Card Tenure_months,WebsiteRegion,Trans_value,Seller_Category,Items_transaction,Shipping_Address,Purchase_Category,Othercard_owner,Seller_way,LastTransaction,Lastflagedvalue,LastMonthsTrans_Freq,Countryissuedcard,Fraud_Detected
0,1,25,US,256.41,Store + online,10,US,online services,Dependent,Yes,0,0,40,United-States,No
1,2,38,US,409.54,Online only,8,International,Accessories,Husband,Yes,0,0,50,United-States,No
2,3,28,EU,293.09,Online only,5,International,Electornics,Husband,Yes,0,0,40,United-States,Yes
3,4,44,US,444.07,Online only,7,International,online services,Husband,Yes,7688,0,40,United-States,Yes
4,5,18,EU,183.11,Online only,7,US,Food,Dependent,No,0,0,30,United-States,No


# <a id="7">Credit Card Fraud Detection Data From UCI machine Learning Repository</a>

In [6]:
path='../data/defaultofcreditcardclients(UCI).xls'
pdexcel = pd.ExcelFile(path)
print (pdexcel.sheet_names)
credit_df = pdexcel.parse("Data" , skiprows=1) # i used skiprows = 1 here since DF was Multi Indexed
print (credit_df.shape)
credit_df.head()

['Data']
(30000, 25)


Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0



# <a id="7">Importing from URL</a>

In [8]:
url = "http://www.basketball-reference.com/leagues/NBA_2015_totals.html"
data = pd.read_html(url)  # read from the URL
data[0].iloc[:, 0:20].head() # check 5 rows (return only 10 columns)

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA
0,1,Quincy Acy,PF,24,NYK,68,22,1287,152,331,0.459,18,60,0.3,134,271,0.494,0.486,76,97
1,2,Jordan Adams,SG,20,MEM,30,0,248,35,86,0.407,10,25,0.4,25,61,0.41,0.465,14,23
2,3,Steven Adams,C,21,OKC,70,67,1771,217,399,0.544,0,2,0.0,217,397,0.547,0.544,103,205
3,4,Jeff Adrien,PF,28,MIN,17,0,215,19,44,0.432,0,0,,19,44,0.432,0.432,22,38
4,5,Arron Afflalo,SG,29,TOT,78,72,2502,375,884,0.424,118,333,0.354,257,551,0.466,0.491,167,198



# <a id="7">Importing the titanic data from a Github url</a>

In [5]:
titanic = pd.read_csv("https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv",
                         sep='\t')
titanic.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


### Let's see all passengers on each cabin class

In [13]:
titanic[["Name","Pclass"]].head()

Unnamed: 0,Name,Pclass
0,"Braund, Mr. Owen Harris",3
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1
2,"Heikkinen, Miss. Laina",3
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1
4,"Allen, Mr. William Henry",3


In [6]:
# NAME , SEX , FARE
titanic[["Name","Sex","Fare"]].head()

Unnamed: 0,Name,Sex,Fare
0,"Braund, Mr. Owen Harris",male,7.25
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,71.2833
2,"Heikkinen, Miss. Laina",female,7.925
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,53.1
4,"Allen, Mr. William Henry",male,8.05


In [14]:
import os
import sys

In [17]:
dates = pd.date_range('1/1/2000', periods=7)
ts = pd.Series(np.arange(7), index=dates)
ts.to_csv('../data/newtseries.csv')
!cat ../data/newtseries.csv

,0
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6


# Read the CSV file above into a data frame ‘demo’.

In [66]:
pd.options.mode.chained_assignment = None

path='../data/demo.csv'
demo=pd.read_csv(path, sep=",", na_values='.')
demo.head(4)

Unnamed: 0,sex,age,siblings,color,income
0,female,27.6,1,red,6611.0
1,female,49.3,2,green,7652.0
2,male,16.8,3,red,1778.0
3,male,14.4,0,red,567.0


# <a id="7">Getting data  from txt files</a>

+ Read the file ‘survey.txt’ in the default folder into a data frame ‘survey1’.
+ Use the first line as the columns and identify dot ‘.’ in the file as missing value.


In [10]:
surveydf=pd.read_table('../data/survey.txt')
surveydf.head()

Unnamed: 0,age worktype education marrital relationship gender capitalgain workhours income
0,25 Private 11th Never-married Own-child Male 0...
1,"38 Private . Married-civ-spouse Husband , 0 50..."
2,28 Local-gov Assoc-acdm Married-civ-spouse Hus...
3,44 Private Some-college Married-civ-spouse Hus...
4,34 Private 10th Never-married Not-in-family Ma...


In [68]:
surveydf=pd.read_table('../data/survey.txt', sep=' ', header=0, na_values='.')
surveydf.head()

Unnamed: 0,age,worktype,education,marrital,relationship,gender,capitalgain,workhours,income
0,25,Private,11th,Never-married,Own-child,Male,0.0,40.0,<=50K
1,38,Private,,Married-civ-spouse,Husband,",",0.0,50.0,<=50K
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Husband,Male,0.0,40.0,>50K
3,44,Private,Some-college,Married-civ-spouse,Husband,,7688.0,40.0,>50K
4,34,Private,10th,Never-married,Not-in-family,Male,0.0,30.0,<=50K


# <a id="7">Getting the cardio data into shape</a>

In [11]:
import os
print ('default',os.getcwd())  # this show my current path

default /Users/flatironschool/Desktop/iNueron/Introduction-to-Python/Part 6 - Pandas/01.Importing Data in Pandas


In [121]:
dataframe = pd.read_csv("../data/cardio_train.csv")
dataframe.head()

Unnamed: 0,id;age;gender;height;weight;ap_hi;ap_lo;cholesterol;gluc;smoke;alco;active;cardio
0,988;22469;1;155;69.0;130;80;2;2;0;0;1;0
1,989;14648;1;163;71.0;110;70;1;1;0;0;1;1
2,990;21901;1;165;70.0;120;80;1;1;0;0;1;0
3,991;14549;2;165;85.0;120;80;1;1;1;1;1;0
4,992;23393;1;155;62.0;120;80;1;1;0;0;1;0


# Observation:
+ The above data is not in the right format. Its should be dealt with by changing the sep 
value to ";"

In [122]:
dataframe = pd.read_csv("../data/cardio_train.csv" , na_values="NA" , sep=";")
dataframe.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,988,22469,1,155,69.0,130,80,2,2,0,0,1,0
1,989,14648,1,163,71.0,110,70,1,1,0,0,1,1
2,990,21901,1,165,70.0,120,80,1,1,0,0,1,0
3,991,14549,2,165,85.0,120,80,1,1,1,1,1,0
4,992,23393,1,155,62.0,120,80,1,1,0,0,1,0



# <a id="7">Getting the titanic data</a>

In [None]:
import os
TITANIC_PATH = os.path.join("datasets","titanic")

In [None]:
def load_titanic_data(filename, titanic_path=TITANIC_PATH):
    csv_path = os.join(titanic_path, filename)
    return pd.read_csv(csv_path)

In [None]:
train_data = load_titanic_data("train.csv")
test_data = load_titanic_data("test.csv")


# <a id="7">Getting the California housing Data from Github</a>

In [56]:
import tarfile
import urllib

In [57]:
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets","housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_the_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing_tgz")  # the housing.tgz is the file with data
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [58]:
fetch_the_data()

In [59]:
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [60]:
data = load_housing_data()
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY



# <a id="7">Reading File paths in Python To connect to Projects</a>

In [None]:
# This gives you the location of the directory
# import os
print ('default',os.getcwd())

pd.set_option('display.max_rows', 200)

pd.options.mode.chained_assignment = None

path = '../data/analytic_data2019.csv'  # note i kept the csv file in dataset file so its easy to call from there
df=pd.read_csv(path, na_values='NA' , skiprows=1)
df



# <a id="7">Read a large data file (.CSV) by chunk into a Pandas Data Frame. Here chunkSize = 100000</a>

+ Sometimes we might have really massive dataframes from 2gb to 100gb and we will like to just sample the data

In [63]:
import os , pandas , numpy

'''

Read a large data file (.CSV) by chunk into a Pandas Data Frame. Here chunkSize = 100000

'''

path='../data/real_estate_db.csv'
reader = pd.read_csv(path, encoding = 'latin1', iterator=True)
loop = True
chunkSize = 10000
chunks = []
while loop:
    try:
        chunk = reader.get_chunk(chunkSize)
        chunks.append(chunk)
    except StopIteration:
        loop = False
        print ("Iteration is stopped.")

df = pd.concat(chunks, ignore_index=True)
df.head()        

Iteration is stopped.


Unnamed: 0,UID,BLOCKID,SUMLEVEL,COUNTYID,STATEID,state,state_ab,city,place,type,...,female_age_mean,female_age_median,female_age_stdev,female_age_sample_weight,female_age_samples,pct_own,married,married_snp,separated,divorced
0,220336,,140,16,2,Alaska,AK,Unalaska,Unalaska City,City,...,32.78177,31.91667,19.31875,440.46429,1894.0,0.25053,0.47388,0.30134,0.03443,0.09802
1,220342,,140,20,2,Alaska,AK,Eagle River,Anchorage,City,...,38.97956,39.66667,20.05513,466.65478,1947.0,0.94989,0.52381,0.01777,0.00782,0.13575
2,220343,,140,20,2,Alaska,AK,Jber,Anchorage,City,...,22.20427,23.16667,13.86575,887.67805,3570.0,0.00759,0.50459,0.06676,0.01,0.01838
3,220345,,140,20,2,Alaska,AK,Anchorage,Point Mackenzie,City,...,37.0075,34.0,22.06347,281.4942,1049.0,0.20247,0.44428,0.05933,0.0,0.21563
4,220347,,140,20,2,Alaska,AK,Anchorage,Anchorage,City,...,34.96611,31.75,20.49887,655.98066,2905.0,0.56936,0.51034,0.08315,0.06731,0.08711


# Json Data

In [19]:
import json

In [25]:
dataframe = """
{"name": "Joshua",
 "places_lived": ["United Kingdom","Spain", "Germany","Italy","Indonesia","Philipines"],
 "pet": null,
 "siblings": [{"name": "Caleb", "age": 28, "pets": ["Zeus", "Beiber"]},
              {"name": "Abi", "age": 20,
               "pets": ["Minu", "Lilly", "Da-rocha"]}]
}
"""

In [26]:
result = json.loads(dataframe)
result

{'name': 'Joshua',
 'places_lived': ['United Kingdom',
  'Spain',
  'Germany',
  'Italy',
  'Indonesia',
  'Philipines'],
 'pet': None,
 'siblings': [{'name': 'Caleb', 'age': 28, 'pets': ['Zeus', 'Beiber']},
  {'name': 'Abi', 'age': 20, 'pets': ['Minu', 'Lilly', 'Da-rocha']}]}

In [27]:
tojson = json.dumps(result)

In [28]:
my_siblings = pd.DataFrame(result["siblings"] , 
                           columns=["name","age","pets"])
my_siblings

Unnamed: 0,name,age,pets
0,Caleb,28,"[Zeus, Beiber]"
1,Abi,20,"[Minu, Lilly, Da-rocha]"


# XML and HTML

In [29]:
dataframe = pd.read_html("../data/fdic_failed_bank_list.html")
failed_banks = dataframe[0]
failed_banks.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Allied Bank,Mulberry,AR,91,Today's Bank,"September 23, 2016","November 17, 2016"
1,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,"August 19, 2016","November 17, 2016"
2,First CornerStone Bank,King of Prussia,PA,35312,First-Citizens Bank & Trust Company,"May 6, 2016","September 6, 2016"
3,Trust Company Bank,Memphis,TN,9956,The Bank of Fayette County,"April 29, 2016","September 6, 2016"
4,North Milwaukee State Bank,Milwaukee,WI,20364,First-Citizens Bank & Trust Company,"March 11, 2016","June 16, 2016"


In [None]:
# Let's make a copy of the Dataframe

In [30]:
df2 = failed_banks.copy()

# converting to date column to datetime bojects
+ We will go into details in our datetime lesson

In [31]:
timestamp = pd.to_datetime(failed_banks["Closing Date"])
timestamp

0     2016-09-23
1     2016-08-19
2     2016-05-06
3     2016-04-29
4     2016-03-11
         ...    
542   2001-07-27
543   2001-05-03
544   2001-02-02
545   2000-12-14
546   2000-10-13
Name: Closing Date, Length: 547, dtype: datetime64[ns]

# APIs

+ We will discuss more into details about APIs in the Data Engineering session

In [32]:
import requests

In [33]:
link = "https://api.github.com/repos/pandas-dev/pandas/issues"
response = requests.get(link)
response   # 200 means all good, 404 means error

<Response [200]>

# lets put it into json

In [34]:
data = response.json()

In [39]:
data[2]["user"]

{'login': 'ivanovmg',
 'id': 41443370,
 'node_id': 'MDQ6VXNlcjQxNDQzMzcw',
 'avatar_url': 'https://avatars3.githubusercontent.com/u/41443370?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/ivanovmg',
 'html_url': 'https://github.com/ivanovmg',
 'followers_url': 'https://api.github.com/users/ivanovmg/followers',
 'following_url': 'https://api.github.com/users/ivanovmg/following{/other_user}',
 'gists_url': 'https://api.github.com/users/ivanovmg/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/ivanovmg/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/ivanovmg/subscriptions',
 'organizations_url': 'https://api.github.com/users/ivanovmg/orgs',
 'repos_url': 'https://api.github.com/users/ivanovmg/repos',
 'events_url': 'https://api.github.com/users/ivanovmg/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/ivanovmg/received_events',
 'type': 'User',
 'site_admin': False}

# Converting the json format into pandas dataframe

In [40]:
new_data = pd.DataFrame(data, columns=["number","title","labels","state"])
new_data.head()

Unnamed: 0,number,title,labels,state
0,37742,"BUG: rolling.apply() with engine=""numba"" cause...","[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open
1,37741,Backport PR #37661 on branch 1.1.x: BUG: Rolli...,"[{'id': 233160, 'node_id': 'MDU6TGFiZWwyMzMxNj...",open
2,37739,TYP: fix mypy ignored err in pandas/io/formats...,"[{'id': 1280988427, 'node_id': 'MDU6TGFiZWwxMj...",open
3,37738,TYP: fix mypy ignored error in pandas/io/forma...,"[{'id': 1280988427, 'node_id': 'MDU6TGFiZWwxMj...",open
4,37737,BUG: Output all columns even if subsetting col...,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open


In [None]:
# Taking a quick look into the labels using a Nested loop
for item in new_data:
    for lab in new_data["labels"]:
        print(lab)

# Interacting with Databases

In [43]:
import sqlite3
#connecting with the database.
db = sqlite3.connect("my_database4.db")
# Drop table if it already exist using execute() method.
db.execute("drop table if exists test")
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""
con = sqlite3.connect('mydata2.sqlite')
con.execute(query)
con.commit()

In [44]:
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()

In [45]:
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

## Convert the above into a pandas dataframe

In [46]:
cursor.description
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5


In [47]:
#connecting with the database.
db = sqlite3.connect("my_database5.db")
# Drop table if it already exist using execute() method.
db.execute("drop table if exists grades1")
# Create table as per requirement
db.execute("create table grades1(id int, name text, score int)")
#inserting values inside the created table
db.execute("insert into grades1(id, name, score) values(101, 'John',99 )")
db.execute("insert into grades1(id, name, score) values(102, 'Gary',90 )")
db.execute("insert into grades1(id, name, score) values(103, 'James', 80 )")
db.execute("insert into grades1(id, name, score) values(104, 'Cathy', 85 )")
db.execute("insert into grades1(id, name, score) values(105, 'Kris',95 )")

<sqlite3.Cursor at 0x11ef07d50>

In [48]:
db.commit()

In [50]:
results = db.execute("Select * from grades1 order by id")
for x in results:
    print(x)
print("-" * 60)    

(101, 'John', 99)
(102, 'Gary', 90)
(103, 'James', 80)
(104, 'Cathy', 85)
(105, 'Kris', 95)
------------------------------------------------------------


In [51]:
results = db.execute("select * from grades1 where name = 'Gary' ")
for row in results: print(row)
print("-"* 60 )

(102, 'Gary', 90)
------------------------------------------------------------


In [52]:
results = db.execute("select * from grades1 where score >= 90 ")
for row in results:
    print(row)
print("-" * 60 )

(101, 'John', 99)
(102, 'Gary', 90)
(105, 'Kris', 95)
------------------------------------------------------------


In [53]:
results = db.execute("select name, score from grades1 order by score desc ")
for row in results:
    print(row)
print("-" * 60 )

('John', 99)
('Kris', 95)
('Gary', 90)
('Cathy', 85)
('James', 80)
------------------------------------------------------------


In [54]:
results = db.execute("select name, score from grades1 order by score")
for row in results:
    print(row)
print("-" * 60 )

('James', 80)
('Cathy', 85)
('Gary', 90)
('Kris', 95)
('John', 99)
------------------------------------------------------------


In [55]:
results = db.execute("select name, score from grades1 order by score")
for row in results:
    print(row)

('James', 80)
('Cathy', 85)
('Gary', 90)
('Kris', 95)
('John', 99)


# Connecting to the Weather data from a Url using OOP and Converting it into a Database using HDF5

In [1]:
import pandas as pd
import numpy as np

import io
from io import StringIO

import requests
import urllib

import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import colors
from matplotlib.ticker import PercentFormatter

In [8]:
class IngestData:
    def __init__(self , dataset_url):
        self.url = dataset_url
        
    def get_data(self):
        response = requests.get(self.url)
        result = str(response.content, "utf-8")
        data = StringIO(result)
        return data
    
    def create_dataframe(self , data , n_rows_skip=5):
        df = pd.read_table(data, sep="\s+" , skiprows=n_rows_skip)
        return df

In [11]:
url = "https://www.metoffice.gov.uk/pub/data/weather/uk/climate/datasets/Sunshine/date/UK.txt"

data = IngestData(url)
data_str = data.get_data()
weather_data = data.create_dataframe(data_str , 5)
weather_data.head()



Unnamed: 0,year,jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec,win,spr,sum,aut,ann
0,1919,40.0,56.9,98.9,121.1,210.5,187.2,156.2,183.3,135.4,111.5,48.0,32.2,---,430.4,526.7,294.9,1381.1
1,1920,43.2,68.7,103.9,95.4,176.5,197.6,123.4,127.4,105.3,101.3,50.5,36.2,144.1,375.9,448.4,257.1,1229.5
2,1921,30.2,62.6,100.1,182.4,208.6,220.8,204.9,138.3,142.5,102.9,55.8,32.9,129.0,491.1,564.0,301.1,1481.9
3,1922,38.4,65.3,97.8,159.6,209.1,182.7,138.2,122.6,101.4,108.7,53.1,32.4,136.6,466.5,443.5,263.2,1309.3
4,1923,42.5,46.8,105.6,128.9,158.5,126.7,162.5,164.8,137.5,90.2,71.9,38.0,121.8,393.0,453.9,299.5,1273.8


In [12]:
# Check the shape of the DataFrame
print('Weather Data - rows:' , weather_data.shape[0],'columns:', weather_data.shape[1])

Weather Data - rows: 102 columns: 18


In [13]:
weather_data = weather_data.iloc[1:, :]
weather_data.dropna(inplace = True) # dropping all missing values
weather_data["win"] = pd.to_numeric(weather_data["win"]) # converting the datatype to numbers

# HDF5

In [16]:
import h5py

In [18]:
def storeInHDF5(df):
    db = pd.HDFStore('Database.h5')

    groups = ['A','B','C']     

    for m in groups:
        subgroups = ['d','e','f']

        for n in subgroups:
            db.put(m + '/' + n, df, format = 'table', data_columns = True)
    db.close()

# Storing the weather data here - Please read the note below if running the line below returns an error
#### <a id='7'>Note: If you get an error here, run this in your terminal pip install --user tables</a>

#### <a id='7'>For conda users: If you get an error here, run this in your terminal conda install pytables</a>

https://stackoverflow.com/questions/58479748/missing-optional-dependency-tables-in-pandas-to-hdf

In [20]:
storeInHDF5(weather_data)

# Now that we have stored the weather data in HDF5. Let's do some simple manipulation

In [21]:
database = pd.HDFStore('Database.h5')
restored_df = pd.DataFrame(database['A/d'])
database.close()
restored_df.head()

Unnamed: 0,year,jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec,win,spr,sum,aut,ann
1,1920,43.2,68.7,103.9,95.4,176.5,197.6,123.4,127.4,105.3,101.3,50.5,36.2,144.1,375.9,448.4,257.1,1229.5
2,1921,30.2,62.6,100.1,182.4,208.6,220.8,204.9,138.3,142.5,102.9,55.8,32.9,129.0,491.1,564.0,301.1,1481.9
3,1922,38.4,65.3,97.8,159.6,209.1,182.7,138.2,122.6,101.4,108.7,53.1,32.4,136.6,466.5,443.5,263.2,1309.3
4,1923,42.5,46.8,105.6,128.9,158.5,126.7,162.5,164.8,137.5,90.2,71.9,38.0,121.8,393.0,453.9,299.5,1273.8
5,1924,41.2,50.4,130.7,133.9,146.0,161.4,169.2,118.8,111.5,82.2,46.2,35.5,129.6,410.6,449.4,239.9,1227.0


##  Loading Files with Summary or Meta Data

Load either of the files `'Zipcode_Demos.csv'` or `'Zipcode_Demos.xlsx'`. What's going on with this dataset? Clean it up into a useable format and describe the nuances of how the data is currently formatted.

All data files are stored in a folder titled `'Data'`.

# Import the file and print the first 5 rows

In [125]:
zipcodes = pd.read_csv("../data/Zipcode_Demos.csv")
zipcodes.head()

Unnamed: 0,0,Average Statistics,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44,Unnamed: 45,Unnamed: 46
0,1,,0.0,,,,,,,,...,,,,,,,,,,
1,2,JURISDICTION NAME,10005.8,,,,,,,,...,,,,,,,,,,
2,3,COUNT PARTICIPANTS,9.4,,,,,,,,...,,,,,,,,,,
3,4,COUNT FEMALE,4.8,,,,,,,,...,,,,,,,,,,
4,5,PERCENT FEMALE,0.404,,,,,,,,...,,,,,,,,,,


# Print the last 5 rows of df

In [126]:
zipcodes.tail()

Unnamed: 0,0,Average Statistics,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44,Unnamed: 45,Unnamed: 46
52,53,10006,6,2,0.33,4,0.67,0,0,6,...,6,100,0,0,6,1,0,0,6,100
53,54,10007,1,0,0.0,1,1.0,0,0,1,...,1,100,1,1,0,0,0,0,1,100
54,55,10009,2,0,0.0,2,1.0,0,0,2,...,2,100,0,0,2,1,0,0,2,100
55,56,10010,0,0,0.0,0,0.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
56,57,10011,3,2,0.67,1,0.33,0,0,3,...,3,100,0,0,3,1,0,0,3,100


# Observation:
+ What is going on with this data set? Anything unusual?
+ At the begininning of the data is full of Null Values
+ And the second part has the data. 

In [127]:
prev_count = 10 ** 3
for row in zipcodes.index:
    count = 0
    for element in zipcodes.iloc[row].isnull():
        if element:
            count += 1
    if count != prev_count and row!=0:
        print("On row {} there are {} null values. The previous row had {} null values."
             .format(row, count, prev_count))
    prev_count = count     

On row 1 there are 44 null values. The previous row had 45 null values.
On row 46 there are 0 null values. The previous row had 44 null values.


# Selecting specific rows for the data full of Null Numbers

In [128]:
zipcodes_2 = pd.read_csv("../data/Zipcode_Demos.csv" , skiprows=[1] , nrows=45 , 
                        usecols=[0,1,2])
zipcodes_2.head()

Unnamed: 0,0,Average Statistics,Unnamed: 2
0,2,JURISDICTION NAME,10005.8
1,3,COUNT PARTICIPANTS,9.4
2,4,COUNT FEMALE,4.8
3,5,PERCENT FEMALE,0.404
4,6,COUNT MALE,4.6


# How to Deal with Corrupted files

## Boss_level (Optional) - Loading Corrupt CSV files

Occasionally, you encounter some really ill-formatted data. One example of this can be data that has strings containing commas in a csv file. Under the standard protocol, when this occurs, one is supposed to use quotes to differentiate between the commas denoting fields and the commas within those fields themselves. For example, we could have a table like this:  

`ReviewerID,Rating,N_reviews,Review,VenueID
123456,4,137,This restaurant was pretty good, we had a great time.,98765`

Which should be saved like this if it were a csv (to avoid confusion with the commas in the Review text):
`"ReviewerID","Rating","N_reviews","Review","VenueID"
"123456","4","137","This restaurant was pretty good, we had a great time.","98765"`

Attempt to import the corrupt file, or at least a small preview of it. It is appropriately titled `'Yelp_Reviews_corrupt.csv'`. Investigate some of the intricacies of skipping rows to then pass over this error and comment on what you think is going on.

# Hint: 
+ Here is a useful programming pattern to use

In [130]:
try:
    data = pd.read_csv("../data/Yelp_Reviews_Corrupt.csv")
except Exception as e:
    print(e)

Error tokenizing data. C error: Expected 10 fields in line 2331, saw 11



# First Iteration

In [131]:
for i in range(1500,2000):
    try:
        data = pd.read_csv("../data/Yelp_Reviews_Corrupt.csv" , nrows=i)
    except:
        print("First failure at: {}".format(i))
        break
data1 = pd.read_csv("../data/Yelp_Reviews_Corrupt.csv", nrows=i-1)
print(len(data))
data1.head()

First failure at: 1962
1961


Unnamed: 0.1,Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,1,pomGBqfbxcqPv14c3XH-ZQ,0,2012-11-13,0.0,dDl8zu1vWPdKGihJrwQbpw,5.0,I love this place! My fiance And I go here atl...,0.0,msQe1u7Z_XuqjGoqhB0J5g
1,2,jtQARsP6P-LbkyjbO1qNGg,1,2014-10-23,1.0,LZp4UX5zK3e-c5ZGSeo3kA,1.0,Terrible. Dry corn bread. Rib tips were all fa...,3.0,msQe1u7Z_XuqjGoqhB0J5g
2,4,Ums3gaP2qM3W1XcA5r6SsQ,0,2014-09-05,0.0,jsDu6QEJHbwP2Blom1PLCA,5.0,Delicious healthy food. The steak is amazing. ...,0.0,msQe1u7Z_XuqjGoqhB0J5g
3,5,vgfcTvK81oD4r50NMjU2Ag,0,2011-02-25,0.0,pfavA0hr3nyqO61oupj-lA,1.0,This place sucks. The customer service is horr...,2.0,msQe1u7Z_XuqjGoqhB0J5g
4,10,yFumR3CWzpfvTH2FCthvVw,0,2016-06-15,0.0,STiFMww2z31siPY7BWNC2g,5.0,I have been an Emerald Club member for a numbe...,0.0,TlvV-xJhmh7LCwJYXkV-cg


# Now that we know that the failure was at row 192, lets skip it

In [135]:
## Second iteration
for i in range(0,500):
    try:
        temp_df = pd.read_csv('../data/Yelp_Reviews_Corrupt.csv', skiprows=1962, nrows=i, names=data1.columns)
    except:
        print("First failure at: {}".format(i))
        break
        
data2 = pd.read_csv('../data/Yelp_Reviews_Corrupt.csv', skiprows=1962, nrows=i, names=data1.columns)
print(len(data2))
data2.head()        

499


Unnamed: 0.1,Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,STAY AWAY FROM THIS PLACE!!!!!!,5,sDofYImMQQmu4Le5G9zmpQ,,,,,,,
1,3948,GAKFx4jFUtTOTpp_jDJnuA,0,2017-09-01,0.0,OUZWMw7EgO7D596pUelSlA,5.0,Nice relaxing atmosphere. Friendly service and...,1.0,6vJY67yve43Ijvn8RKVUow
2,3949,0QzCeORfF8EY34UODWRV9A,0,2017-09-03,0.0,7lbykaWFD8YBwT0mU1Rexw,4.0,Very pleased with our experience. Great off th...,0.0,6vJY67yve43Ijvn8RKVUow
3,3950,tlt8zNrZ6_A3DmXiM-cnBA,0,2016-06-12,0.0,Nd_soHwCYi8adcNIT2w9LQ,1.0,Wife went to this location and was horrible. N...,0.0,S0dnPb1OzaqdBSOxyLr7BQ
4,3952,XD0LjNuPPwJPsTAHecUh7A,0,2015-08-23,0.0,FUUTAr5CECrkfRa9Y2-MSg,1.0,Not baby friendly anymore.,,


In [139]:
temp_data = pd.read_csv("../data/Yelp_Reviews_Corrupt.csv" ,names=data1.columns ,skiprows=1)
print(len(temp_data))
temp_data.head()                        

4651


Unnamed: 0.1,Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,1,pomGBqfbxcqPv14c3XH-ZQ,0,2012-11-13,0,dDl8zu1vWPdKGihJrwQbpw,5,I love this place! My fiance And I go here atl...,0,msQe1u7Z_XuqjGoqhB0J5g
1,2,jtQARsP6P-LbkyjbO1qNGg,1,2014-10-23,1,LZp4UX5zK3e-c5ZGSeo3kA,1,Terrible. Dry corn bread. Rib tips were all fa...,3,msQe1u7Z_XuqjGoqhB0J5g
2,4,Ums3gaP2qM3W1XcA5r6SsQ,0,2014-09-05,0,jsDu6QEJHbwP2Blom1PLCA,5,Delicious healthy food. The steak is amazing. ...,0,msQe1u7Z_XuqjGoqhB0J5g
3,5,vgfcTvK81oD4r50NMjU2Ag,0,2011-02-25,0,pfavA0hr3nyqO61oupj-lA,1,This place sucks. The customer service is horr...,2,msQe1u7Z_XuqjGoqhB0J5g
4,10,yFumR3CWzpfvTH2FCthvVw,0,2016-06-15,0,STiFMww2z31siPY7BWNC2g,5,I have been an Emerald Club member for a numbe...,0,TlvV-xJhmh7LCwJYXkV-cg


In [142]:
pd.read_csv('../data/Yelp_Reviews_Corrupt.csv', skiprows=len(data1)+len(data2), names=data1.columns)


Unnamed: 0.1,Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,Cons:,,,,,,,,,
1,- Dusty! Not sure if it's all of Vegas but I...,,,,,,,,,
2,- Valet parking: kinda inconvenient when you ...,,,,,,,,,
3,- Sofabed is extremely flimsy,if you have more than 2 people,insist on 2 queen beds. the sofa cushions ar...,,,,,,,
4,Other points:,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
2577,First off,it was really awkward sitting on the benches ...,as people walked past us while to wait for ou...,,,,,,,
2578,Second,when we were seated,it was so loud. It felt like we were in a hig...,,,,,,,
2579,Finally - Food was mediocre. I was extremely d...,but it wasn't flavourful.,,,,,,,,
2580,Wasn't worth the hype,unfortunately.,1,PkRFSQgSfca9Tamq7b2LdQ,,,,,,


# Summary:
+ In this session we had a lot of fun with various ways of importing files in pandas
+ We also looked at how to deal with corrupted files in pandas