# Working with many and/or large files

In this section, we will take a look at techniques for working with many files, as well as large files.

In [1]:
import pandas as pd
from dfply import *

## Baseball data

We will be using the [Baseball Databank](https://github.com/chadwickbureau/baseballdatabank), make sure you have these data cloned into `./data/baseball`.

In [2]:
!git clone https://github.com/chadwickbureau/baseballdatabank.git ./data/baseball

fatal: destination path './data/baseball' already exists and is not an empty directory.


## Working with many files.

* Use `glob.glob` to find all files that match a pattern
* Convert all files to `pd.DataFrames`
* Store the `df` in a list or dictionary

## What the heck is a `glob`

`glob.glob`

* Takes a path regular expression
* Returns a list of files that match the patterm
* Relative paths!

## Store in `dict` or `list`?

* Natural sequence/order? $\rightarrow$ `list`
    *  Example: Lakes data and years are a natural sequence
* Easier to refer by name? $\rightarrow$ `dict`
    * Baseball files have no order and easier to refer to by name

## Example 1 - Reading the baseball database.

#### Step 1 - Get the files names

In [2]:
from glob import glob
files = glob('./data/baseball/core/*.csv')
files

['./data/baseball/core/AllstarFull.csv',
 './data/baseball/core/Appearances.csv',
 './data/baseball/core/AwardsManagers.csv',
 './data/baseball/core/AwardsPlayers.csv',
 './data/baseball/core/AwardsShareManagers.csv',
 './data/baseball/core/AwardsSharePlayers.csv',
 './data/baseball/core/Batting.csv',
 './data/baseball/core/BattingPost.csv',
 './data/baseball/core/CollegePlaying.csv',
 './data/baseball/core/Fielding.csv',
 './data/baseball/core/FieldingOF.csv',
 './data/baseball/core/FieldingOFsplit.csv',
 './data/baseball/core/FieldingPost.csv',
 './data/baseball/core/HallOfFame.csv',
 './data/baseball/core/HomeGames.csv',
 './data/baseball/core/Managers.csv',
 './data/baseball/core/ManagersHalf.csv',
 './data/baseball/core/Parks.csv',
 './data/baseball/core/People.csv',
 './data/baseball/core/Pitching.csv',
 './data/baseball/core/PitchingPost.csv',
 './data/baseball/core/Salaries.csv',
 './data/baseball/core/Schools.csv',
 './data/baseball/core/SeriesPost.csv',
 './data/baseball/core

#### Step 2 - Make helper functions to get the name from path

In [3]:
import re
FILE_NAME_RE = re.compile(r'^\./data/baseball/core/([a-zA-Z_]*)\.csv$')
file_name = lambda p: FILE_NAME_RE.match(p).group(1) 
file_names = lambda files: [file_name(p) for p in files]
file_names(files)[:2]

['AllstarFull', 'Appearances']

#### Step 3 - Use a comprehension to read in all files

**Note:** The data is small (< 10mb total) so it is safe to read all at once.

In [4]:
dfs = {name:pd.read_csv(path) for name, path in zip(file_names(files), files)}
dfs['Pitching'].head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,W,L,G,GS,CG,...,IBB,WP,HBP,BK,BFP,GF,R,SH,SF,GIDP
0,bechtge01,1871,1,PH1,,1,2,3,3,2,...,,7,,0,146.0,0,42,,,
1,brainas01,1871,1,WS3,,12,15,30,30,30,...,,7,,0,1291.0,0,292,,,
2,fergubo01,1871,1,NY2,,0,0,1,0,0,...,,2,,0,14.0,0,9,,,
3,fishech01,1871,1,RC1,,4,16,24,24,22,...,,20,,0,1080.0,1,257,,,
4,fleetfr01,1871,1,NY2,,0,1,1,1,1,...,,0,,0,57.0,0,21,,,


## <font color="red"> Exercise 1 </font>

Use `glob` to read the following files into a `dict`: `Person.csv`, `Survey.csv`, `Site.csv`, `Visited.csv`

In [5]:
from glob import glob
files_wanted = glob('./data/*.csv')
files_wanted

['./data/Artists.csv',
 './data/Artworks.csv',
 './data/ebola_data_db_format.csv',
 './data/health_survey.csv',
 './data/heroes_information.csv',
 './data/OralFlag_example.csv',
 './data/Person.csv',
 './data/PEW_income_religion.csv',
 './data/Site.csv',
 './data/super_hero_powers.csv',
 './data/Survey.csv',
 './data/TB_bad.csv',
 './data/TB_burden_age_sex_2019-01-07.csv',
 './data/TB_example.csv',
 './data/uber-raw-data-apr14-small.csv',
 './data/uber-raw-data-apr14.csv',
 './data/Visited.csv']

In [6]:
s1 = './data/Artists.csv'
s2 = './data/Person.csv'
s3 = './data/PEW_income_religion.csv'
s4 = './data/Site.csv'
s5 = './data/Survey.csv'
s6 = './data/Visited.csv'
tester = r'\./data/(Person|Site|Survey|Visited)\.csv$'
rig = re.compile(tester)

In [7]:
assert not rig.search(s1), 'Failed 1'
assert not rig.search(s3), 'Failed 3'
assert rig.search(s2), 'Failed 2'
assert rig.search(s4), 'Failed 4'
assert rig.search(s5), 'Failed 5'
assert rig.search(s6), 'Failed 6'

In [8]:
import re
FILE_NAME_RE = re.compile(r'^\./data/(Person|Site|Survey|Visited)\.csv$')
file_name = lambda p: FILE_NAME_RE.search(p).group(1) 
file_names = lambda files: [file_name(p) for p in files]
file_names(files_wanted)[:2]

AttributeError: 'NoneType' object has no attribute 'group'

## Up Next

In [Lecture 3.2 - Aggregating Large Files with Pandas](./3_2_aggregating_large_files_in_pandas.ipynb), we will look at using `pandas` to read and aggregate chunks of a large file.