<a href="https://colab.research.google.com/github/dmburns1729/Class-Files/blob/main/Handling_Large_Files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Handling Large Files with Low RAM (Optional)

Solution
Instead of loading the entire dataframe using pd.read_csv, we can instead create a special TextFileReader object, which will allow us to read in our dataframe in chunks.

1. Use the chunksize argument for pd.read_csv to create a TextFileReader.
    - chunksize is the number of rows to load at once.
    - We will use 100,000 rows in our examples.
2. Use the .get_chunk() method to extract the first chunk of rows.
3. Figure out your entire workflow for that file using just temp_df chunk, and save to disk.
4. Now combine the workflow into 1 large loop through the entire textfilereader.
5. Use glob to easily combine all chunk csvs into 1 final.

In [None]:
import pandas as pd
## Use the chunksize argument for pd.read_csv to create a TextFileReader.
basics_url = 'https://datasets.imdbws.com/title.basics.tsv.gz'
df_reader = pd.read_csv(basics_url, sep='\t',
                        low_memory=False, chunksize=100_000)
df_reader

In [None]:
We now get a TextFileReader instead of a DataFrame.
The TextFileReader is designed to return one chunk at a time from the source file as a dataframe using the reader.get_chunk() method.
It keep tracks of its position in the original file using the ._currow attribute.

In [None]:
## the first row # of the next chunk is stored under ._currow
df_reader._curro

In [None]:
# Use the .get_chunk() method to extract the first chunk of rows.
temp_df = df_reader.get_chunk()
temp_df

In [None]:
## checking the updated ._currow
df_reader._currow

In [None]:
## Replace "\N" with np.nan
temp_df.replace({'\\N':np.nan},inplace=True)
## Eliminate movies that are null for runtimeMinute, genres, and startYear
temp_df = temp_df.dropna(subset=['runtimeMinutes','genres','startYear'])

In [None]:
### Convert startyear to numeric for slicing
temp_df['startYear'] = temp_df['startYear'].astype(float).copy()
## keep startYear 2000-2022
temp_df = temp_df[(temp_df['startYear']>=2000)&(temp_df['startYear']<2022)]
temp_df

In [None]:
## Make the Data folder if it doesn't already exist
import os
os.makedirs('Data',exist_ok=True)

In [None]:
## Programatically saving an fname using the chunk #
chunk_num=1
fname= f'Data/title_basics_chunk_{chunk_num:03d}.csv.gz'
fname

In [None]:
## Save temp_df to disk using the fname.
temp_df.to_csv(fname, compression='gzip')
## incrementing chunk_num by 1 for the next file.
chunk_num+=1

In [None]:
pd.read_csv(fname)

In [None]:
pd.read_csv(fname, index_col=0)

In [None]:
# title basics
basics_url = 'https://datasets.imdbws.com/title.basics.tsv.gz'
chunk_num = 1
df_reader = pd.read_csv(basics_url, sep='\t',
                        low_memory=False, chunksize=100_000)

In [None]:
for temp_df in df_reader:
        #### COMBINED WORKFLOW FROM ABOVE
    ## Replace "\N" with np.nan
    temp_df.replace({'\\N':np.nan},inplace=True)
    ## Eliminate movies that are null for runtimeMinute, genres, and startYear
    temp_df = temp_df.dropna(subset=['runtimeMinutes','genres','startYear'])

    ## NOTE: THERE ARE ADDITIONAL REQUIRED FILTERING STEPS FOR THE PROJECT NOT SHOWN HERE
    ### Convert startyear to numeric for slicing
    ## convert numeric features
    temp_df['startYear'] = temp_df['startYear'].astype(float)
    ## keep startYear 2000-2022
    temp_df = temp_df[(temp_df['startYear']>=2000)&(temp_df['startYear']<2022)]

    ### Saving chunk to disk
    fname= f'Data/title_basics_chunk_{chunk_num:03d}.csv.gz'
    temp_df.to_csv(fname, compression='gzip')
    print(f"- Saved {fname}")

    ## increment chunk_num
    chunk_num+=1
## Closing the reader now that we are done looping through the file
df_reader.close()

APPENDIX
Bonus functions for getting the size of dataframes and files


In [None]:
import os
def get_memory_usage(df,units='mb'):
    """returns memory size of dataframe in requested units"""
    memory = df.memory_usage().sum()

    if units.lower()=='mb':
        denom = 1e6
    elif units.lower()=='gb':
        denom = 1e9
    else:
        raise Exception('Units must be either "mb" or "gb"')
    val = memory/denom
    print(f"- Total Memory Usage = {val} {units.upper()}")

In [None]:
get_memory_usage(df_combined)


In [None]:

copy
def get_filesize(fname, units='mb'):
    """Get size of file at given path in MB or GB"""
    if units.lower()=='mb':
        denom = 1e6
    elif units.lower()=='gb':
        denom = 1e9
    else:
        raise Exception('Units must be either "mb" or "gb"')

    import os
    size = os.path.getsize(fname)

    val = size/denom
    print(f"- {fname} is {val} {units.upper()} on disk.")

In [None]:
get_filesize(final_fname)
