<a href="https://colab.research.google.com/github/ghosesuvendu/dataScience/blob/main/Lossless_compression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")
data.sample(10)

Unnamed: 0,date,county,state,fips,cases,deaths
110628,2020-05-03,Todd,Kentucky,21219.0,12,0.0
33973,2020-04-05,Boyle,Kentucky,21021.0,8,0.0
1332058,2021-05-18,Johnson,Tennessee,47091.0,2424,39.0
210467,2020-06-05,Hubbard,Minnesota,27057.0,3,0.0
1069736,2021-02-27,Blount,Alabama,1009.0,6095,127.0
1388712,2021-06-05,Wells,Indiana,18179.0,2968,83.0
1040921,2021-02-18,Bibb,Georgia,13021.0,14352,363.0
1347895,2021-05-23,Ellis,Oklahoma,40045.0,357,5.0
334251,2020-07-14,Waller,Texas,48473.0,257,0.0
149637,2020-05-16,Benton,Missouri,29015.0,8,0.0


In [4]:
data.info(verbose=False, memory_usage="deep")
#Loading the entire dataset takes 379 MB of memory!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1806867 entries, 0 to 1806866
Columns: 6 entries, date to deaths
dtypes: float64(2), int64(1), object(3)
memory usage: 379.7 MB


In [5]:
#only need two columns of this dataset, the county and the case columns, Loading only the two columns I need requires 124 MB, which is a 32% decrease in memory usage.
df = data[["county","cases"]]
#df.head()
df.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1806867 entries, 0 to 1806866
Columns: 2 entries, county to cases
dtypes: int64(1), object(1)
memory usage: 124.4 MB


In [6]:
#I can use Pandas to load only the columns I need like this
csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
#Load only two columns
df_2col = pd.read_csv(csv , usecols=["county", "cases"])
df_2col.info(verbose=False, memory_usage="deep")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1806867 entries, 0 to 1806866
Columns: 2 entries, county to cases
dtypes: int64(1), object(1)
memory usage: 124.4 MB


In [7]:
#Manipulate datatypes
#Another way to decrease the memory usage of our data is to truncate numerical items in the data. For example, whenever we load a CSV into a column in a data frame, if the file contains numbers, it will store it as which takes 64 bytes to store one numerical value. However, we can truncate that and use other int formates to save some memory.
#int8 can store integers from -128 to 127.
#int16 can store integers from -32768 to 32767.
#int64 can store integers from -9223372036854775808 to 9223372036854775807.
#if you know that the numbers in a particular column will never be higher than 32767, you can use an int16 or int32 and reduce the memory usage of that column by 75%.
df_2col["cases"].memory_usage(index=False, deep=True)

14454936

In [8]:
#So, assume that the number of cases in each county can’t exceed 32767 — which is not true in real-life — then, we can truncate that column to int16 instead of int64.
csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
#Load only two columns
df_2col = pd.read_csv(csv , usecols=["county", "cases"], dtype={"cases" : "int16"})
df_2col.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1806867 entries, 0 to 1806866
Columns: 2 entries, county to cases
dtypes: int16(1), object(1)
memory usage: 114.1 MB


In [9]:
df_2col["cases"].memory_usage(index=False, deep=True)
# instead of 14428928 now size is reduced to 3607232

3613734

In [10]:
#Sparse columns
#If the data has a column or more with lots of empty values stored as NaN you save memory by using a sparse column representation so you won't waste memory storing all those empty values.
#Assume the county column has some NaN values and I just want to skip the rows containing NaN, I can do that easily using sparse series.
csv = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
df_2col = pd.read_csv(csv , usecols=["county", "cases"])
pd_series = df_2col.astype("Sparse[str]")

In [12]:
pd_series.describe()

Unnamed: 0,county,cases
count,1806867,1806867
unique,1930,71470
top,Washington,1
freq,17581,26529


In [13]:
pd_series.info(verbose=False, memory_usage="deep")

KeyboardInterrupt: ignored