---
# Creating and Persisting DataFrames
---

In [1]:
import numpy as np
import pandas as pd

Create parallel lists with data in them. Each of these lists will be a column in the
DataFrame, so they should have the same type

In [2]:
fname = ["Paul", "John", "Richard", "George"]
lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
birth = [1942, 1940, 1940, 1943]

Create a dictionary from the lists, mapping the column name to the list:

In [3]:
people = dict(first=fname, last=lname, birth=birth)

Create a DataFrame from the dictionary

In [4]:
beatles = pd.DataFrame(people)

In [5]:
beatles

Unnamed: 0,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943


In [None]:
beatles.index

RangeIndex(start=0, stop=4, step=1)

In [None]:
# change index
pd.DataFrame(data=people, index=list('abcd'))

Unnamed: 0,first,last,birth
a,Paul,McCartney,1942
b,John,Lennon,1940
c,Richard,Starkey,1940
d,George,Harrison,1943


## Writing CSV

Write the DataFrame to a CSV file:

In [None]:
beatles

Unnamed: 0,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943


In [6]:
 from io import StringIO

In [7]:
beatles_file = StringIO()
beatles.to_csv(beatles_file)

Look at the file contents:

In [8]:
print(beatles_file.getvalue())

,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943



In [9]:
_ = beatles_file.seek(0)
pd.read_csv(beatles_file)

Unnamed: 0.1,Unnamed: 0,first,last,birth
0,0,Paul,McCartney,1942
1,1,John,Lennon,1940
2,2,Richard,Starkey,1940
3,3,George,Harrison,1943


The `read_csv` function has an `index_col` parameter that you can use to specify the
location of the index:

In [10]:
_ = beatles_file.seek(0)
pd.read_csv(beatles_file, index_col=0)

Unnamed: 0,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943


Alternatively, if we didn't want to include the index when writing the CSV file, we can set the
index parameter to `False`:

In [11]:
beatles_file = StringIO()
beatles.to_csv(beatles_file, index=False)

In [12]:
_ = beatles_file.seek(0)
pd.read_csv(beatles_file)

Unnamed: 0,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943


## Reading large CSV files

The pandas library is an in-memory tool. You need to be able to fit your data in memory to use pandas with it. If you come across a large CSV file that you want to process, you have a few options. If you can process portions of it at a time, you can read it into chunks and process each chunk. Alternatively, if you know that you should have enough memory to load the file, there are a few hints to help pare down the file size.  
Note that in general, you should have three to ten times the amount of memory as the size of the DataFrame that you want to manipulate.  Extra memory should give you enough extra space to perform many of the common operations. 

In [None]:
diamonds = pd.read_csv('./diamonds.csv', nrows=1000)
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


`.info` method to see how much memory the sample of data uses

In [None]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    1000 non-null   float64
 1   cut      1000 non-null   object 
 2   color    1000 non-null   object 
 3   clarity  1000 non-null   object 
 4   depth    1000 non-null   float64
 5   table    1000 non-null   float64
 6   price    1000 non-null   int64  
 7   x        1000 non-null   float64
 8   y        1000 non-null   float64
 9   z        1000 non-null   float64
dtypes: float64(6), int64(1), object(3)
memory usage: 78.2+ KB


Use the `dtype` parameter to `read_csv` to tell it to use the correct (or smaller) numeric types

In [None]:
diamonds2 = pd.read_csv('./diamonds.csv', nrows=1000, dtype={
    'carat': np.float32,
    'depth': np.float32,
    'table': np.float32,
    "x": np.float32,
    "y": np.float32,
    "z": np.float32,
    "price": np.int16,
})

diamonds2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    1000 non-null   float32
 1   cut      1000 non-null   object 
 2   color    1000 non-null   object 
 3   clarity  1000 non-null   object 
 4   depth    1000 non-null   float32
 5   table    1000 non-null   float32
 6   price    1000 non-null   int16  
 7   x        1000 non-null   float32
 8   y        1000 non-null   float32
 9   z        1000 non-null   float32
dtypes: float32(6), int16(1), object(3)
memory usage: 49.0+ KB


Make sure that summary statistics are similar with our new dataset to the original


In [None]:
diamonds.describe().equals(diamonds2.describe())

False

In [None]:
diamonds.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
carat,1000.0,0.68928,0.195291,0.2,0.7,0.71,0.79,1.27
depth,1000.0,61.7228,1.758879,53.0,60.9,61.8,62.6,69.5
table,1000.0,57.7347,2.467946,52.0,56.0,57.0,59.0,70.0
price,1000.0,2476.54,839.57562,326.0,2777.0,2818.0,2856.0,2898.0
x,1000.0,5.60594,0.625173,3.79,5.64,5.77,5.92,7.12
y,1000.0,5.59918,0.611974,3.75,5.63,5.76,5.91,7.05
z,1000.0,3.45753,0.389819,2.27,3.45,3.55,3.64,4.33


In [None]:
diamonds2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
carat,1000.0,0.689281,0.195291,0.2,0.7,0.71,0.79,1.27
depth,1000.0,61.722824,1.758878,53.0,60.900002,61.799999,62.599998,69.5
table,1000.0,57.734699,2.467944,52.0,56.0,57.0,59.0,70.0
price,1000.0,2476.54,839.57562,326.0,2777.0,2818.0,2856.0,2898.0
x,1000.0,5.605941,0.625173,3.79,5.64,5.77,5.92,7.12
y,1000.0,5.59918,0.611972,3.75,5.63,5.76,5.91,7.05
z,1000.0,3.457533,0.389819,2.27,3.45,3.55,3.64,4.33


Use the `dtype` parameter to use change object types to categoricals. First, inspect the `.value_counts` method of the object columns. If they are low cardinality, you can convert them to categorical columns to save even more memory

In [None]:
diamonds2.select_dtypes(include='object').columns

Index(['cut', 'color', 'clarity'], dtype='object')

In [None]:
diamonds2.cut.value_counts()

Ideal        333
Premium      290
Very Good    226
Good          89
Fair          62
Name: cut, dtype: int64

In [None]:
diamonds2.color.value_counts()

E    240
F    226
G    139
D    129
H    125
I     95
J     46
Name: color, dtype: int64

In [None]:
diamonds2.clarity.value_counts()

SI1     306
VS2     218
VS1     159
SI2     154
VVS2     62
VVS1     58
I1       29
IF       14
Name: clarity, dtype: int64

Because these are of low cardinality, we can convert them to categoricals and use
around 37% of the original size

In [None]:
diamonds3 = pd.read_csv('./diamonds.csv', nrows=1000,
                        dtype={
                                'carat': np.float32,
                                'depth': np.float32,
                                'table': np.float32,
                                "x": np.float32,
                                "y": np.float32,
                                "z": np.float32,
                                "price": np.int16,
                               "cut": "category",
                               "color": "category",
                               "clarity": "category",
                              },
                        )
diamonds3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    1000 non-null   float32 
 1   cut      1000 non-null   category
 2   color    1000 non-null   category
 3   clarity  1000 non-null   category
 4   depth    1000 non-null   float32 
 5   table    1000 non-null   float32 
 6   price    1000 non-null   int16   
 7   x        1000 non-null   float32 
 8   y        1000 non-null   float32 
 9   z        1000 non-null   float32 
dtypes: category(3), float32(6), int16(1)
memory usage: 29.4 KB


If there are columns that we know we can ignore, we can use the usecols
parameter to specify the columns we want to load. Here, we will ignore columns x, y,
and z:

In [None]:
cols = ['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price']
diamonds4 = pd.read_csv('./diamonds.csv', nrows=1000,
                        dtype={
                                'carat': np.float32,
                                'depth': np.float32,
                                'table': np.float32,
                                "x": np.float32,
                                "y": np.float32,
                                "z": np.float32,
                                "price": np.int16,
                               "cut": "category",
                               "color": "category",
                               "clarity": "category",
                              },
                        usecols=cols,
                        )
diamonds4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    1000 non-null   float32 
 1   cut      1000 non-null   category
 2   color    1000 non-null   category
 3   clarity  1000 non-null   category
 4   depth    1000 non-null   float32 
 5   table    1000 non-null   float32 
 6   price    1000 non-null   int16   
dtypes: category(3), float32(3), int16(1)
memory usage: 17.7 KB


If the preceding steps are not sufficient to create a small enough DataFrame, you might still be in luck. If you can process chunks of the data at a time and do not need all of it in memory, you can use the `chunksize` parameter

In [None]:
diamonds_iter = pd.read_csv('./diamonds.csv', nrows=1000,
                        dtype={
                                'carat': np.float32,
                                'depth': np.float32,
                                'table': np.float32,
                                "x": np.float32,
                                "y": np.float32,
                                "z": np.float32,
                                "price": np.int16,
                               "cut": "category",
                               "color": "category",
                               "clarity": "category",
                              },
                        usecols=cols,
                        chunksize=200,
                        )

In [None]:
def process(df):
  return (
      f"proceed {df.size} items"
  )

In [None]:
for chunk in diamonds_iter:
  process(chunk)

In [None]:
np.iinfo(np.int8)

iinfo(min=-128, max=127, dtype=int8)

In [None]:
np.finfo(np.float16)

finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)

In [None]:
diamonds.price.memory_usage()

8128

In [None]:
diamonds.price.memory_usage(index=False)

8000

In [None]:
diamonds.cut.memory_usage()

8128

In [None]:
diamonds.cut.memory_usage(deep=True)

63461

In [None]:
diamonds4.to_feather('/tmp/d.arr')

In [None]:
diamonds5 = pd.read_feather('/tmp/d.arr')

In [None]:
diamonds4.to_parquet('/tmp/d.pqt')

In [None]:
diamonds4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    1000 non-null   float32 
 1   cut      1000 non-null   category
 2   color    1000 non-null   category
 3   clarity  1000 non-null   category
 4   depth    1000 non-null   float32 
 5   table    1000 non-null   float32 
 6   price    1000 non-null   int16   
dtypes: category(3), float32(3), int16(1)
memory usage: 17.7 KB


## Using Excel files

May need to install `xlwt` or `openpyxl` to write XLS or XLSX files, respectively

Create an Excel file using the `.to_excel` method

In [13]:
beatles.to_excel('beat.xls')

In [14]:
beatles.to_excel('beat.xlsx')

Read the Excel file with the `read_excel` function

In [16]:
beat2 = pd.read_excel('beat.xls', index_col=0)
beat2

Unnamed: 0,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943


Inspect data types of the file to check that Excel preserved the types:

In [17]:
beat2.dtypes

first    object
last     object
birth     int64
dtype: object

In [29]:
beat2['first'] = beat2['first'].astype('category')

In [30]:
beat2.dtypes

first    category
last       object
birth       int64
dtype: object

We can use pandas to write to a sheet of a spreadsheet. You can pass a sheet_name
parameter to the `.to_excel` method to tell it the name of the sheet to create

In [33]:
xl_writer = pd.ExcelWriter('beat2.xlsx')
beatles.to_excel(xl_writer, sheet_name='All')
beatles[beatles.birth <= 1942].to_excel(xl_writer, sheet_name='1940')
xl_writer.save

<bound method _OpenpyxlWriter.save of <pandas.io.excel._openpyxl._OpenpyxlWriter object at 0x7fec7a582050>>

## Working with ZIP files


If the CSV file is the only file in the ZIP file, you can just call the `read_csv` function on
it:

In [35]:
autos = pd.read_csv('./vehicles.csv.zip')
autos.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,barrels08,barrelsA08,charge120,charge240,city08,city08U,cityA08,cityA08U,cityCD,cityE,cityUF,co2,co2A,co2TailpipeAGpm,co2TailpipeGpm,comb08,comb08U,combA08,combA08U,combE,combinedCD,combinedUF,cylinders,displ,drive,engId,eng_dscr,feScore,fuelCost08,fuelCostA08,fuelType,fuelType1,ghgScore,ghgScoreA,highway08,highway08U,highwayA08,highwayA08U,highwayCD,highwayE,...,id,lv2,lv4,make,model,mpgData,phevBlended,pv2,pv4,range,rangeCity,rangeCityA,rangeHwy,rangeHwyA,trany,UCity,UCityA,UHighway,UHighwayA,VClass,year,youSaveSpend,guzzler,trans_dscr,tCharger,sCharger,atvType,fuelType2,rangeA,evMotor,mfrCode,c240Dscr,charge240b,c240bDscr,createdOn,modifiedOn,startStop,phevCity,phevHwy,phevComb
0,15.695714,0.0,0.0,0.0,19,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,423.190476,21,0.0,0,0.0,0.0,0.0,0.0,4.0,2.0,Rear-Wheel Drive,9011,(FFS),-1,1900,0,Regular,Regular Gasoline,-1,-1,25,0.0,0,0.0,0.0,0.0,...,1,0,0,Alfa Romeo,Spider Veloce 2000,Y,False,0,0,0,0.0,0.0,0.0,0.0,Manual 5-spd,23.3333,0.0,35.0,0.0,Two Seaters,1985,-1750,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
1,29.964545,0.0,0.0,0.0,9,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,807.909091,11,0.0,0,0.0,0.0,0.0,0.0,12.0,4.9,Rear-Wheel Drive,22020,(GUZZLER),-1,3650,0,Regular,Regular Gasoline,-1,-1,14,0.0,0,0.0,0.0,0.0,...,10,0,0,Ferrari,Testarossa,N,False,0,0,0,0.0,0.0,0.0,0.0,Manual 5-spd,11.0,0.0,19.0,0.0,Two Seaters,1985,-10500,T,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
2,12.207778,0.0,0.0,0.0,23,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,329.148148,27,0.0,0,0.0,0.0,0.0,0.0,4.0,2.2,Front-Wheel Drive,2100,(FFS),-1,1500,0,Regular,Regular Gasoline,-1,-1,33,0.0,0,0.0,0.0,0.0,...,100,0,0,Dodge,Charger,Y,False,0,0,0,0.0,0.0,0.0,0.0,Manual 5-spd,29.0,0.0,47.0,0.0,Subcompact Cars,1985,250,,SIL,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
3,29.964545,0.0,0.0,0.0,10,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,807.909091,11,0.0,0,0.0,0.0,0.0,0.0,8.0,5.2,Rear-Wheel Drive,2850,,-1,3650,0,Regular,Regular Gasoline,-1,-1,12,0.0,0,0.0,0.0,0.0,...,1000,0,0,Dodge,B150/B250 Wagon 2WD,N,False,0,0,0,0.0,0.0,0.0,0.0,Automatic 3-spd,12.2222,0.0,16.6667,0.0,Vans,1985,-10500,,,,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
4,17.347895,0.0,0.0,0.0,17,0.0,0,0.0,0.0,0.0,0.0,-1,-1,0.0,467.736842,19,0.0,0,0.0,0.0,0.0,0.0,4.0,2.2,4-Wheel or All-Wheel Drive,66031,"(FFS,TRBO)",-1,2500,0,Premium,Premium Gasoline,-1,-1,23,0.0,0,0.0,0.0,0.0,...,10000,0,14,Subaru,Legacy AWD Turbo,N,False,0,90,0,0.0,0.0,0.0,0.0,Manual 5-spd,21.0,0.0,32.0,0.0,Compact Cars,1993,-4750,,,T,,,,,,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0


In [36]:
autos.columns

Index(['barrels08', 'barrelsA08', 'charge120', 'charge240', 'city08',
       'city08U', 'cityA08', 'cityA08U', 'cityCD', 'cityE', 'cityUF', 'co2',
       'co2A', 'co2TailpipeAGpm', 'co2TailpipeGpm', 'comb08', 'comb08U',
       'combA08', 'combA08U', 'combE', 'combinedCD', 'combinedUF', 'cylinders',
       'displ', 'drive', 'engId', 'eng_dscr', 'feScore', 'fuelCost08',
       'fuelCostA08', 'fuelType', 'fuelType1', 'ghgScore', 'ghgScoreA',
       'highway08', 'highway08U', 'highwayA08', 'highwayA08U', 'highwayCD',
       'highwayE', 'highwayUF', 'hlv', 'hpv', 'id', 'lv2', 'lv4', 'make',
       'model', 'mpgData', 'phevBlended', 'pv2', 'pv4', 'range', 'rangeCity',
       'rangeCityA', 'rangeHwy', 'rangeHwyA', 'trany', 'UCity', 'UCityA',
       'UHighway', 'UHighwayA', 'VClass', 'year', 'youSaveSpend', 'guzzler',
       'trans_dscr', 'tCharger', 'sCharger', 'atvType', 'fuelType2', 'rangeA',
       'evMotor', 'mfrCode', 'c240Dscr', 'charge240b', 'c240bDscr',
       'createdOn', 'modifiedOn

In [37]:
autos.modifiedOn.dtype

dtype('O')

One thing to be aware of is that if you have date columns in the CSV file, they will be left as strings. You have two options to convert them. You can use the `parse_dates` parameter from `read_csv` and convert them when loading the file. Alternatively, you
can use the more powerful `to_datetime` function after loading

In [38]:
autos.modifiedOn.head()

0    Tue Jan 01 00:00:00 EST 2013
1    Tue Jan 01 00:00:00 EST 2013
2    Tue Jan 01 00:00:00 EST 2013
3    Tue Jan 01 00:00:00 EST 2013
4    Tue Jan 01 00:00:00 EST 2013
Name: modifiedOn, dtype: object

In [39]:
pd.to_datetime(autos.modifiedOn)



0       2013-01-01
1       2013-01-01
2       2013-01-01
3       2013-01-01
4       2013-01-01
           ...    
39096   2013-01-01
39097   2013-01-01
39098   2013-01-01
39099   2013-01-01
39100   2013-01-01
Name: modifiedOn, Length: 39101, dtype: datetime64[ns]

Convertion during load time

In [40]:
autos = pd.read_csv('./vehicles.csv.zip', parse_dates=['modifiedOn'])
autos.modifiedOn.head()

  interactivity=interactivity, compiler=compiler, result=result)


0   2013-01-01
1   2013-01-01
2   2013-01-01
3   2013-01-01
4   2013-01-01
Name: modifiedOn, dtype: datetime64[ns]

If the ZIP file has many files in it, reading a CSV file from it is a little more involved.
The `read_csv` function does not have the ability to specify a file inside a ZIP file.
Instead, we will use the `zipfile` module from the Python standard library.

In [41]:
import zipfile

In [44]:
with zipfile.ZipFile('kaggle-survey-2018.zip') as z:
  print('\n'.join(z.namelist()))
  kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
  kag_questions = kag.iloc[0]
  survey = kag.iloc[1:]

multipleChoiceResponses.csv
freeFormResponses.csv
SurveySchema.csv


  interactivity=interactivity, compiler=compiler, result=result)


In [45]:
kag_questions.head()

Time from Start to Finish (seconds)                                Duration (in seconds)
Q1                                                What is your gender? - Selected Choice
Q1_OTHER_TEXT                          What is your gender? - Prefer to self-describe...
Q2                                                           What is your age (# years)?
Q3                                             In which country do you currently reside?
Name: 0, dtype: object

In [46]:
survey.head()

Unnamed: 0,Time from Start to Finish (seconds),Q1,Q1_OTHER_TEXT,Q2,Q3,Q4,Q5,Q6,Q6_OTHER_TEXT,Q7,Q7_OTHER_TEXT,Q8,Q9,Q10,Q11_Part_1,Q11_Part_2,Q11_Part_3,Q11_Part_4,Q11_Part_5,Q11_Part_6,Q11_Part_7,Q11_OTHER_TEXT,Q12_MULTIPLE_CHOICE,Q12_Part_1_TEXT,Q12_Part_2_TEXT,Q12_Part_3_TEXT,Q12_Part_4_TEXT,Q12_Part_5_TEXT,Q12_OTHER_TEXT,Q13_Part_1,Q13_Part_2,Q13_Part_3,Q13_Part_4,Q13_Part_5,Q13_Part_6,Q13_Part_7,Q13_Part_8,Q13_Part_9,Q13_Part_10,Q13_Part_11,...,Q46,Q47_Part_1,Q47_Part_2,Q47_Part_3,Q47_Part_4,Q47_Part_5,Q47_Part_6,Q47_Part_7,Q47_Part_8,Q47_Part_9,Q47_Part_10,Q47_Part_11,Q47_Part_12,Q47_Part_13,Q47_Part_14,Q47_Part_15,Q47_Part_16,Q48,Q49_Part_1,Q49_Part_2,Q49_Part_3,Q49_Part_4,Q49_Part_5,Q49_Part_6,Q49_Part_7,Q49_Part_8,Q49_Part_9,Q49_Part_10,Q49_Part_11,Q49_Part_12,Q49_OTHER_TEXT,Q50_Part_1,Q50_Part_2,Q50_Part_3,Q50_Part_4,Q50_Part_5,Q50_Part_6,Q50_Part_7,Q50_Part_8,Q50_OTHER_TEXT
1,710,Female,-1,45-49,United States of America,Doctoral degree,Other,Consultant,-1,Other,0,,,I do not know,Analyze and understand data to influence produ...,Build and/or run a machine learning service th...,Build and/or run the data infrastructure that ...,,Do research that advances the state of the art...,,,-1,"Cloud-based data software & APIs (AWS, GCP, Az...",-1,-1,-1,-1,0,-1,Jupyter/IPython,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
2,434,Male,-1,30-34,Indonesia,Bachelor’s degree,Engineering (non-computer focused),Other,0,Manufacturing/Fabrication,-1,5-10,"10-20,000",No (we do not use ML methods),,,,,,None of these activities are an important part...,,-1,"Basic statistical software (Microsoft Excel, G...",1,-1,-1,-1,-1,-1,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1,,,,,,,,,-1
3,718,Female,-1,30-34,United States of America,Master’s degree,"Computer science (software engineering, etc.)",Data Scientist,-1,I am a student,-1,0-1,"0-10,000",I do not know,Analyze and understand data to influence produ...,,,,,,,-1,Local or hosted development environments (RStu...,-1,-1,-1,0,-1,-1,,,,,,,MATLAB,,,,,...,10-20,,Examine feature correlations,Examine feature importances,,,,,Plot predicted vs. actual results,,,,,,,,,I am confident that I can explain the outputs ...,,,,,,,Make sure the code is human-readable,Define all random seeds,,Include a text file describing all dependencies,,,-1,,Too time-consuming,,,,,,,-1
4,621,Male,-1,35-39,United States of America,Master’s degree,"Social sciences (anthropology, psychology, soc...",Not employed,-1,,-1,,,,,,,,,,,-1,Local or hosted development environments (RStu...,-1,-1,-1,1,-1,-1,Jupyter/IPython,RStudio,PyCharm,,,,,Visual Studio,,,Vim,...,20-30,,Examine feature correlations,Examine feature importances,Plot decision boundaries,,,,Plot predicted vs. actual results,,Sensitivity analysis/perturbation importance,,,,,,,"Yes, most ML models are ""black boxes""",,,"Share data, code, and environment using a host...",,,,Make sure the code is human-readable,,Define relative rather than absolute file paths,,,,-1,,,Requires too much technical knowledge,,Not enough incentives to share my work,,,,-1
5,731,Male,-1,22-24,India,Master’s degree,Mathematics or statistics,Data Analyst,-1,I am a student,-1,0-1,"0-10,000",I do not know,,,,,,,Other,-1,"Advanced statistical software (SPSS, SAS, etc.)",-1,1,-1,-1,-1,-1,,RStudio,,,,,,,,,,...,20-30,,,,,Create partial dependence plots,,,,,,,,,,,,I am confident that I can understand and expla...,,,,,,,,,Define relative rather than absolute file paths,,,,-1,,Too time-consuming,,,Not enough incentives to share my work,,,,-1


In [47]:
survey.head(2).T

Unnamed: 0,1,2
Time from Start to Finish (seconds),710,434
Q1,Female,Male
Q1_OTHER_TEXT,-1,-1
Q2,45-49,30-34
Q3,United States of America,Indonesia
...,...,...
Q50_Part_5,,
Q50_Part_6,,
Q50_Part_7,,
Q50_Part_8,,


the zipfile module will not work with URLs (unlike the `read_csv` function). So, if
your ZIP file is in a URL, you will need to download it first.  
The `read_csv` function will work with other compression types as well. If you have GZIP, BZ2,
or XZ files, pandas can handle those as long as they are just compressing a CSV file and not
a directory

## Reading JSON


In [48]:
import json

In [49]:
encoded = json.dumps(people)
encoded

'{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}'

In [50]:
json.loads(encoded)

{'birth': [1942, 1940, 1940, 1943],
 'first': ['Paul', 'John', 'Richard', 'George'],
 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison']}

Read the data using the `read_json` function

In [51]:
beatles_js = pd.read_json(encoded)
beatles_js

Unnamed: 0,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943


## Reading HTML tables

In [52]:
url = 'https://en.wikipedia.org/wiki/The_Beatles_discography'
dfs = pd.read_html(url)
len(dfs)

58

Inspect the first DataFrame

In [53]:
dfs[0]

Unnamed: 0,The Beatles discography,The Beatles discography.1
0,The Beatles in 1965,The Beatles in 1965
1,Studio albums,"13 (core catalogue), 21 (worldwide)"
2,Live albums,5
3,Compilation albums,54
4,Video albums,22
5,Music videos,68
6,EPs,36
7,Singles,63
8,Mash-ups,2
9,Box sets,17


In [54]:
dfs = pd.read_html(url, match='List of studio albums', na_values="—")
len(dfs)

1

In [55]:
dfs[0].columns

MultiIndex([(               'Title',          'Title'),
            (       'Album details',  'Album details'),
            ('Peak chart positions',       'UK[6][7]'),
            ('Peak chart positions',         'AUS[8]'),
            ('Peak chart positions',         'CAN[9]'),
            ('Peak chart positions',        'FRA[10]'),
            ('Peak chart positions',        'GER[11]'),
            ('Peak chart positions',        'NOR[12]'),
            ('Peak chart positions',     'US[13][14]'),
            (      'Certifications', 'Certifications')],
           )

In [56]:
dfs = pd.read_html(url, match='List of studio albums', 
                   na_values="—", 
                   header=[0, 1])
len(dfs)

1

In [57]:
dfs[0]

Unnamed: 0_level_0,Title,Album details,Peak chart positions,Peak chart positions,Peak chart positions,Peak chart positions,Peak chart positions,Peak chart positions,Peak chart positions,Certifications
Unnamed: 0_level_1,Title,Album details,UK[6][7],AUS[8],CAN[9],FRA[10],GER[11],NOR[12],US[13][14],Certifications
0,Please Please Me,Released: 22 March 1963 Label: Parlophone (UK),1,,,5,5,,,BPI: Platinum[15] ARIA: Gold[16] MC: Gold[17] ...
1,With the Beatles[B],Released: 22 November 1963 Label: Parlophone (...,1,,,5,1,,,BPI: Gold[15] ARIA: Gold[16] BVMI: Gold[19] MC...
2,Introducing... The Beatles,Released: 10 January 1964 Label: Vee-Jay (US),,,,,,,2,RIAA: Platinum[18]
3,Meet the Beatles!,Released: 20 January 1964 Label: Capitol (US),,,1,,,,1,MC: Platinum[17] RIAA: 5× Platinum[18]
4,Twist and Shout,Released: 3 February 1964 Label: Capitol (CAN),,,1,,,,,MC: 3× Platinum[17]
5,The Beatles' Second Album,Released: 10 April 1964 Label: Capitol (US),,,1,,50,,1,MC: Platinum[17] RIAA: 2× Platinum[18]
6,The Beatles' Long Tall Sally,Released: 11 May 1964 Label: Capitol (CAN),,,1,,,,,MC: Gold[17]
7,A Hard Day's Night,Released: 26 June 1964 Label: United Artists (...,,,1,5,,,1,MC: Platinum[17] RIAA: 4× Platinum[18]
8,A Hard Day's Night,Released: 10 July 1964 Label: Parlophone (UK),1,1,,,1,,,BPI: Platinum[15] ARIA: Gold[16]
9,Something New,Released: 20 July 1964 Label: Capitol (US),,,2,,38,,2,MC: Gold[17] RIAA: 2× Platinum[18]


In [58]:
df = dfs[0]
df.columns = [
              "Title",
              "Release",
              "UK",
              "AUS",
              "CAN",
              "FRA",
              "GER",
              "NOR",
              "US",
              "Certifications",
]

df

Unnamed: 0,Title,Release,UK,AUS,CAN,FRA,GER,NOR,US,Certifications
0,Please Please Me,Released: 22 March 1963 Label: Parlophone (UK),1,,,5,5,,,BPI: Platinum[15] ARIA: Gold[16] MC: Gold[17] ...
1,With the Beatles[B],Released: 22 November 1963 Label: Parlophone (...,1,,,5,1,,,BPI: Gold[15] ARIA: Gold[16] BVMI: Gold[19] MC...
2,Introducing... The Beatles,Released: 10 January 1964 Label: Vee-Jay (US),,,,,,,2,RIAA: Platinum[18]
3,Meet the Beatles!,Released: 20 January 1964 Label: Capitol (US),,,1,,,,1,MC: Platinum[17] RIAA: 5× Platinum[18]
4,Twist and Shout,Released: 3 February 1964 Label: Capitol (CAN),,,1,,,,,MC: 3× Platinum[17]
5,The Beatles' Second Album,Released: 10 April 1964 Label: Capitol (US),,,1,,50,,1,MC: Platinum[17] RIAA: 2× Platinum[18]
6,The Beatles' Long Tall Sally,Released: 11 May 1964 Label: Capitol (CAN),,,1,,,,,MC: Gold[17]
7,A Hard Day's Night,Released: 26 June 1964 Label: United Artists (...,,,1,5,,,1,MC: Platinum[17] RIAA: 4× Platinum[18]
8,A Hard Day's Night,Released: 10 July 1964 Label: Parlophone (UK),1,1,,,1,,,BPI: Platinum[15] ARIA: Gold[16]
9,Something New,Released: 20 July 1964 Label: Capitol (US),,,2,,38,,2,MC: Gold[17] RIAA: 2× Platinum[18]


split the release column into two columns, release_
date and label

In [60]:
res = (
    df.pipe(
        lambda df_ : df_[~df_.Title.str.startswith("Released")]
    )
    .assign(
        release_date=lambda df_: pd.to_datetime(
            df_.Release.str.extract(
                r"Released: (.*) Label")[0].str.replace(r"\[E\]", "")
            ),
            label=lambda df_: df_.Release.str.extract(
                r"Label: (.*)"
            ),
        )
    .loc[:,[
            "Title",
            "UK",
            "AUS",
            "CAN",
            "FRA",
            "GER",
            "NOR",
            "US",
            "release_date",
            "label",
    ],]
    )


res

Unnamed: 0,Title,UK,AUS,CAN,FRA,GER,NOR,US,release_date,label
0,Please Please Me,1,,,5,5,,,1963-03-22,Parlophone (UK)
1,With the Beatles[B],1,,,5,1,,,1963-11-22,"Parlophone (UK), Capitol (CAN), Odeon (FRA)"
2,Introducing... The Beatles,,,,,,,2,1964-01-10,Vee-Jay (US)
3,Meet the Beatles!,,,1,,,,1,1964-01-20,Capitol (US)
4,Twist and Shout,,,1,,,,,1964-02-03,Capitol (CAN)
5,The Beatles' Second Album,,,1,,50,,1,1964-04-10,Capitol (US)
6,The Beatles' Long Tall Sally,,,1,,,,,1964-05-11,Capitol (CAN)
7,A Hard Day's Night,,,1,5,,,1,1964-06-26,United Artists (US)[C][D]
8,A Hard Day's Night,1,1,,,1,,,1964-07-10,Parlophone (UK)
9,Something New,,,2,,38,,2,1964-07-20,Capitol (US)


 Can also use the attrs parameter to select a table from the page

In [61]:
url = 'https://github.com/dajebbar/python-pract/blob/main/work/data_analysis/pandas_cookbook_1.x/data/amzn_stock.csv'
dfs = pd.read_html(url, attrs={"class": "csv-data"})
len(dfs)

1

In [62]:
dfs[0]

Unnamed: 0.1,Unnamed: 0,Date,Open,High,Low,Close,Volume
0,,2010-01-04,136.25,136.61,133.14,133.90,7600543
1,,2010-01-05,133.43,135.48,131.81,134.69,8856456
2,,2010-01-06,134.60,134.73,131.65,132.25,7180977
3,,2010-01-07,132.01,132.32,128.80,130.00,11030124
4,,2010-01-08,130.56,133.68,129.03,133.52,9833829
...,...,...,...,...,...,...,...
1891,,2017-07-11,993.00,995.99,983.72,994.13,2982726
1892,,2017-07-12,1000.65,1008.55,998.10,1006.51,3608574
1893,,2017-07-13,1004.62,1006.88,995.90,1000.63,2880769
1894,,2017-07-14,1002.40,1004.45,996.89,1001.81,2102469
