---
# Creating and Persisting DataFrames
---

In [1]:
import numpy as np
import pandas as pd

Create parallel lists with data in them. Each of these lists will be a column in the
DataFrame, so they should have the same type

In [2]:
fname = ["Paul", "John", "Richard", "George"]
lname = ["McCartney", "Lennon", "Starkey", "Harrison"]
birth = [1942, 1940, 1940, 1943]

Create a dictionary from the lists, mapping the column name to the list:

In [3]:
people = dict(first=fname, last=lname, birth=birth)

Create a DataFrame from the dictionary

In [4]:
beatles = pd.DataFrame(people)

In [5]:
beatles

Unnamed: 0,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943


In [None]:
beatles.index

RangeIndex(start=0, stop=4, step=1)

In [None]:
# change index
pd.DataFrame(data=people, index=list('abcd'))

Unnamed: 0,first,last,birth
a,Paul,McCartney,1942
b,John,Lennon,1940
c,Richard,Starkey,1940
d,George,Harrison,1943


## Writing CSV

Write the DataFrame to a CSV file:

In [None]:
beatles

Unnamed: 0,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943


In [6]:
 from io import StringIO

In [7]:
beatles_file = StringIO()
beatles.to_csv(beatles_file)

Look at the file contents:

In [8]:
print(beatles_file.getvalue())

,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943



In [9]:
_ = beatles_file.seek(0)
pd.read_csv(beatles_file)

Unnamed: 0.1,Unnamed: 0,first,last,birth
0,0,Paul,McCartney,1942
1,1,John,Lennon,1940
2,2,Richard,Starkey,1940
3,3,George,Harrison,1943


The `read_csv` function has an `index_col` parameter that you can use to specify the
location of the index:

In [10]:
_ = beatles_file.seek(0)
pd.read_csv(beatles_file, index_col=0)

Unnamed: 0,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943


Alternatively, if we didn't want to include the index when writing the CSV file, we can set the
index parameter to `False`:

In [11]:
beatles_file = StringIO()
beatles.to_csv(beatles_file, index=False)

In [12]:
_ = beatles_file.seek(0)
pd.read_csv(beatles_file)

Unnamed: 0,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943


## Reading large CSV files

The pandas library is an in-memory tool. You need to be able to fit your data in memory to use pandas with it. If you come across a large CSV file that you want to process, you have a few options. If you can process portions of it at a time, you can read it into chunks and process each chunk. Alternatively, if you know that you should have enough memory to load the file, there are a few hints to help pare down the file size.  
Note that in general, you should have three to ten times the amount of memory as the size of the DataFrame that you want to manipulate.  Extra memory should give you enough extra space to perform many of the common operations. 

In [None]:
diamonds = pd.read_csv('./diamonds.csv', nrows=1000)
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


`.info` method to see how much memory the sample of data uses

In [None]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    1000 non-null   float64
 1   cut      1000 non-null   object 
 2   color    1000 non-null   object 
 3   clarity  1000 non-null   object 
 4   depth    1000 non-null   float64
 5   table    1000 non-null   float64
 6   price    1000 non-null   int64  
 7   x        1000 non-null   float64
 8   y        1000 non-null   float64
 9   z        1000 non-null   float64
dtypes: float64(6), int64(1), object(3)
memory usage: 78.2+ KB


Use the `dtype` parameter to `read_csv` to tell it to use the correct (or smaller) numeric types

In [None]:
diamonds2 = pd.read_csv('./diamonds.csv', nrows=1000, dtype={
    'carat': np.float32,
    'depth': np.float32,
    'table': np.float32,
    "x": np.float32,
    "y": np.float32,
    "z": np.float32,
    "price": np.int16,
})

diamonds2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    1000 non-null   float32
 1   cut      1000 non-null   object 
 2   color    1000 non-null   object 
 3   clarity  1000 non-null   object 
 4   depth    1000 non-null   float32
 5   table    1000 non-null   float32
 6   price    1000 non-null   int16  
 7   x        1000 non-null   float32
 8   y        1000 non-null   float32
 9   z        1000 non-null   float32
dtypes: float32(6), int16(1), object(3)
memory usage: 49.0+ KB


Make sure that summary statistics are similar with our new dataset to the original


In [None]:
diamonds.describe().equals(diamonds2.describe())

False

In [None]:
diamonds.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
carat,1000.0,0.68928,0.195291,0.2,0.7,0.71,0.79,1.27
depth,1000.0,61.7228,1.758879,53.0,60.9,61.8,62.6,69.5
table,1000.0,57.7347,2.467946,52.0,56.0,57.0,59.0,70.0
price,1000.0,2476.54,839.57562,326.0,2777.0,2818.0,2856.0,2898.0
x,1000.0,5.60594,0.625173,3.79,5.64,5.77,5.92,7.12
y,1000.0,5.59918,0.611974,3.75,5.63,5.76,5.91,7.05
z,1000.0,3.45753,0.389819,2.27,3.45,3.55,3.64,4.33


In [None]:
diamonds2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
carat,1000.0,0.689281,0.195291,0.2,0.7,0.71,0.79,1.27
depth,1000.0,61.722824,1.758878,53.0,60.900002,61.799999,62.599998,69.5
table,1000.0,57.734699,2.467944,52.0,56.0,57.0,59.0,70.0
price,1000.0,2476.54,839.57562,326.0,2777.0,2818.0,2856.0,2898.0
x,1000.0,5.605941,0.625173,3.79,5.64,5.77,5.92,7.12
y,1000.0,5.59918,0.611972,3.75,5.63,5.76,5.91,7.05
z,1000.0,3.457533,0.389819,2.27,3.45,3.55,3.64,4.33


Use the `dtype` parameter to use change object types to categoricals. First, inspect the `.value_counts` method of the object columns. If they are low cardinality, you can convert them to categorical columns to save even more memory

In [None]:
diamonds2.select_dtypes(include='object').columns

Index(['cut', 'color', 'clarity'], dtype='object')

In [None]:
diamonds2.cut.value_counts()

Ideal        333
Premium      290
Very Good    226
Good          89
Fair          62
Name: cut, dtype: int64

In [None]:
diamonds2.color.value_counts()

E    240
F    226
G    139
D    129
H    125
I     95
J     46
Name: color, dtype: int64

In [None]:
diamonds2.clarity.value_counts()

SI1     306
VS2     218
VS1     159
SI2     154
VVS2     62
VVS1     58
I1       29
IF       14
Name: clarity, dtype: int64

Because these are of low cardinality, we can convert them to categoricals and use
around 37% of the original size

In [None]:
diamonds3 = pd.read_csv('./diamonds.csv', nrows=1000,
                        dtype={
                                'carat': np.float32,
                                'depth': np.float32,
                                'table': np.float32,
                                "x": np.float32,
                                "y": np.float32,
                                "z": np.float32,
                                "price": np.int16,
                               "cut": "category",
                               "color": "category",
                               "clarity": "category",
                              },
                        )
diamonds3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    1000 non-null   float32 
 1   cut      1000 non-null   category
 2   color    1000 non-null   category
 3   clarity  1000 non-null   category
 4   depth    1000 non-null   float32 
 5   table    1000 non-null   float32 
 6   price    1000 non-null   int16   
 7   x        1000 non-null   float32 
 8   y        1000 non-null   float32 
 9   z        1000 non-null   float32 
dtypes: category(3), float32(6), int16(1)
memory usage: 29.4 KB


If there are columns that we know we can ignore, we can use the usecols
parameter to specify the columns we want to load. Here, we will ignore columns x, y,
and z:

In [None]:
cols = ['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price']
diamonds4 = pd.read_csv('./diamonds.csv', nrows=1000,
                        dtype={
                                'carat': np.float32,
                                'depth': np.float32,
                                'table': np.float32,
                                "x": np.float32,
                                "y": np.float32,
                                "z": np.float32,
                                "price": np.int16,
                               "cut": "category",
                               "color": "category",
                               "clarity": "category",
                              },
                        usecols=cols,
                        )
diamonds4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    1000 non-null   float32 
 1   cut      1000 non-null   category
 2   color    1000 non-null   category
 3   clarity  1000 non-null   category
 4   depth    1000 non-null   float32 
 5   table    1000 non-null   float32 
 6   price    1000 non-null   int16   
dtypes: category(3), float32(3), int16(1)
memory usage: 17.7 KB


If the preceding steps are not sufficient to create a small enough DataFrame, you might still be in luck. If you can process chunks of the data at a time and do not need all of it in memory, you can use the `chunksize` parameter

In [None]:
diamonds_iter = pd.read_csv('./diamonds.csv', nrows=1000,
                        dtype={
                                'carat': np.float32,
                                'depth': np.float32,
                                'table': np.float32,
                                "x": np.float32,
                                "y": np.float32,
                                "z": np.float32,
                                "price": np.int16,
                               "cut": "category",
                               "color": "category",
                               "clarity": "category",
                              },
                        usecols=cols,
                        chunksize=200,
                        )

In [None]:
def process(df):
  return (
      f"proceed {df.size} items"
  )

In [None]:
for chunk in diamonds_iter:
  process(chunk)

In [None]:
np.iinfo(np.int8)

iinfo(min=-128, max=127, dtype=int8)

In [None]:
np.finfo(np.float16)

finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)

In [None]:
diamonds.price.memory_usage()

8128

In [None]:
diamonds.price.memory_usage(index=False)

8000

In [None]:
diamonds.cut.memory_usage()

8128

In [None]:
diamonds.cut.memory_usage(deep=True)

63461

In [None]:
diamonds4.to_feather('/tmp/d.arr')

In [None]:
diamonds5 = pd.read_feather('/tmp/d.arr')

In [None]:
diamonds4.to_parquet('/tmp/d.pqt')

In [None]:
diamonds4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    1000 non-null   float32 
 1   cut      1000 non-null   category
 2   color    1000 non-null   category
 3   clarity  1000 non-null   category
 4   depth    1000 non-null   float32 
 5   table    1000 non-null   float32 
 6   price    1000 non-null   int16   
dtypes: category(3), float32(3), int16(1)
memory usage: 17.7 KB


## Using Excel files

May need to install `xlwt` or `openpyxl` to write XLS or XLSX files, respectively

Create an Excel file using the `.to_excel` method

In [13]:
beatles.to_excel('beat.xls')

In [14]:
beatles.to_excel('beat.xlsx')

Read the Excel file with the `read_excel` function

In [16]:
beat2 = pd.read_excel('beat.xls', index_col=0)
beat2

Unnamed: 0,first,last,birth
0,Paul,McCartney,1942
1,John,Lennon,1940
2,Richard,Starkey,1940
3,George,Harrison,1943


Inspect data types of the file to check that Excel preserved the types:

In [17]:
beat2.dtypes

first    object
last     object
birth     int64
dtype: object

In [29]:
beat2['first'] = beat2['first'].astype('category')

In [30]:
beat2.dtypes

first    category
last       object
birth       int64
dtype: object

We can use pandas to write to a sheet of a spreadsheet. You can pass a sheet_name
parameter to the `.to_excel` method to tell it the name of the sheet to create

In [33]:
xl_writer = pd.ExcelWriter('beat2.xlsx')
beatles.to_excel(xl_writer, sheet_name='All')
beatles[beatles.birth <= 1942].to_excel(xl_writer, sheet_name='1940')
xl_writer.save

<bound method _OpenpyxlWriter.save of <pandas.io.excel._openpyxl._OpenpyxlWriter object at 0x7fec7a582050>>