### Loading Data

In [1]:
import numpy as np
import pandas as pd

#### Loading CSV Files

We saw we could use Python's `csv` module to load a csv file into a list of lists.

Pandas offers similar functionality, where we can load a csv file directly into a `DataFrame`.

Let's look at the file `populations.csv`:

In [2]:
from csv import reader
with open('populations.csv') as f:
    data = reader(f)
    print(next(data))
    print(next(data))
    print(next(data))

['Geographic Area', 'July 1, 2001 Estimate', 'July 1, 2000 Estimate', 'April 1, 2000 Population Estimates Base']
['United States', '284796887', '282124631', '281421906']
['Alabama', '4464356', '4451493', '4447100']


Now let's use Pandas' built-in csv reader:

In [3]:
df = pd.read_csv('populations.csv')

In [4]:
df

Unnamed: 0,Geographic Area,"July 1, 2001 Estimate","July 1, 2000 Estimate","April 1, 2000 Population Estimates Base"
0,United States,284796887,282124631,281421906
1,Alabama,4464356,4451493,4447100
2,Alaska,634892,627601,626932
3,Arizona,5307331,5165274,5130632
4,Arkansas,2692090,2678030,2673400
5,California,34501130,34000446,33871648
6,Colorado,4417714,4323410,4301261
7,Connecticut,3425074,3410079,3405565
8,Delaware,796165,786234,783600
9,District of Columbia,571822,571066,572059


First thing is that we'll probably want to relabel the column labels - they're far too long.

In [5]:
df = df.rename(
    columns={
        'Geographic Area':'region', 
        'July 1, 2001 Estimate': '2001', 
        'July 1, 2000 Estimate': '2000', 
        'April 1, 2000 Population Estimates Base': 'not needed'
    }
)
df

Unnamed: 0,region,2001,2000,not needed
0,United States,284796887,282124631,281421906
1,Alabama,4464356,4451493,4447100
2,Alaska,634892,627601,626932
3,Arizona,5307331,5165274,5130632
4,Arkansas,2692090,2678030,2673400
5,California,34501130,34000446,33871648
6,Colorado,4417714,4323410,4301261
7,Connecticut,3425074,3410079,3405565
8,Delaware,796165,786234,783600
9,District of Columbia,571822,571066,572059


We'll want to drop that last column:

In [6]:
df = df.drop(columns=['not needed'])
df

Unnamed: 0,region,2001,2000
0,United States,284796887,282124631
1,Alabama,4464356,4451493
2,Alaska,634892,627601
3,Arizona,5307331,5165274
4,Arkansas,2692090,2678030
5,California,34501130,34000446
6,Colorado,4417714,4323410
7,Connecticut,3425074,3410079
8,Delaware,796165,786234
9,District of Columbia,571822,571066


And lastly, we may want to use the `region` column as the explicit index:

In [7]:
df = df.set_index('region')
df

Unnamed: 0_level_0,2001,2000
region,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,284796887,282124631
Alabama,4464356,4451493
Alaska,634892,627601
Arizona,5307331,5165274
Arkansas,2692090,2678030
California,34501130,34000446
Colorado,4417714,4323410
Connecticut,3425074,3410079
Delaware,796165,786234
District of Columbia,571822,571066


Let's look at the data frame info:

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 52 entries, United States to Wyoming
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   2001    52 non-null     int64
 1   2000    52 non-null     int64
dtypes: int64(2)
memory usage: 1.2+ KB


Instead of loading the CSV file and then manipulating the indexes, columns etc, we could do this using additional arguments in the `read_csv` function.

In [9]:
df = pd.read_csv(
    'populations.csv', 
    header=0,  # indicates we are using the first row as column headers
    usecols=[0, 1, 2],  # specify which columns to use in DF
    names=['region', '2001', '2000'],  # rename the column labels
    index_col=0,  # use column index 0 as the index
)
df

Unnamed: 0_level_0,2001,2000
region,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,284796887,282124631
Alabama,4464356,4451493
Alaska,634892,627601
Arizona,5307331,5165274
Arkansas,2692090,2678030
California,34501130,34000446
Colorado,4417714,4323410
Connecticut,3425074,3410079
Delaware,796165,786234
District of Columbia,571822,571066


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 52 entries, United States to Wyoming
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   2001    52 non-null     int64
 1   2000    52 non-null     int64
dtypes: int64(2)
memory usage: 1.2+ KB


Now that we have this data stuctured this way, we can use universal functions/operators, just like we did in NumPy to calculate the population growth from 2000 to 2001 as a percentage of the 2000 population.

In [11]:
df.loc[:, '2001']

region
United States           284796887
Alabama                   4464356
Alaska                     634892
Arizona                   5307331
Arkansas                  2692090
California               34501130
Colorado                  4417714
Connecticut               3425074
Delaware                   796165
District of Columbia       571822
Florida                  16396515
Georgia                   8383915
Hawaii                    1224398
Idaho                     1321006
Illinois                 12482301
Indiana                   6114745
Iowa                      2923179
Kansas                    2694641
Kentucky                  4065556
Louisiana                 4465430
Maine                     1286670
Maryland                  5375156
Massachusetts             6379304
Michigan                  9990817
Minnesota                 4972294
Mississippi               2858029
Missouri                  5629707
Montana                    904433
Nebraska                  1713235
Nevada 

Note that we could also use plain `[]` since we are going to work with just column labels, and no slicing.

In [12]:
df['2001']

region
United States           284796887
Alabama                   4464356
Alaska                     634892
Arizona                   5307331
Arkansas                  2692090
California               34501130
Colorado                  4417714
Connecticut               3425074
Delaware                   796165
District of Columbia       571822
Florida                  16396515
Georgia                   8383915
Hawaii                    1224398
Idaho                     1321006
Illinois                 12482301
Indiana                   6114745
Iowa                      2923179
Kansas                    2694641
Kentucky                  4065556
Louisiana                 4465430
Maine                     1286670
Maryland                  5375156
Massachusetts             6379304
Michigan                  9990817
Minnesota                 4972294
Mississippi               2858029
Missouri                  5629707
Montana                    904433
Nebraska                  1713235
Nevada 

We can find the population differences using the `-` operator (which again will use the NumPy subtraction universal function):

In [13]:
df['2001'] - df['2000']

region
United States           2672256
Alabama                   12863
Alaska                     7291
Arizona                  142057
Arkansas                  14060
California               500684
Colorado                  94304
Connecticut               14995
Delaware                   9931
District of Columbia        756
Florida                  342187
Georgia                  154092
Hawaii                    12117
Idaho                     21748
Illinois                  46331
Indiana                   24795
Iowa                      -4330
Kansas                     2891
Kentucky                  18132
Louisiana                 -4540
Maine                      9709
Maryland                  64248
Massachusetts             22232
Michigan                  38811
Minnesota                 41201
Mississippi                8929
Missouri                  26154
Montana                    1276
Nebraska                    658
Nevada                    87351
New Hampshire             19300
N

And, we can find that as a percentage of the `2000` estimate:

In [14]:
increase = 100 * (df['2001'] - df['2000']) / df['2000']
increase

region
United States           0.947190
Alabama                 0.288959
Alaska                  1.161725
Arizona                 2.750232
Arkansas                0.525013
California              1.472581
Colorado                2.181241
Connecticut             0.439726
Delaware                1.263110
District of Columbia    0.132384
Florida                 2.131431
Georgia                 1.872361
Hawaii                  0.999521
Idaho                   1.673878
Illinois                0.372556
Indiana                 0.407146
Iowa                   -0.147907
Kansas                  0.107402
Kentucky                0.447989
Louisiana              -0.101567
Maine                   0.760321
Maryland                1.209737
Massachusetts           0.349721
Michigan                0.389982
Minnesota               0.835535
Mississippi             0.313397
Missouri                0.466740
Montana                 0.141282
Nebraska                0.038422
Nevada                  4.327042
New

#### Loading Excel Files

Pandas can also load data directly from an Excel spreadsheet.

Spreadsheets often have multiple tabs, so one of the arguments to the `read_excel` function is which tab to use, either by index (starting at `0`, and the default setting), or by name (such as `Sheet1`, etc).

In order for Pandas to read Excel spreadsheets, an additional library needs to be installed that will handle reading Excel files - one like the `xlrd` library is a possible choice (there are others, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html). 

You can simply `pip install xlrd` in your virtual environment. (It should already be installed if you followed the installation instructions at the beginning of this course).

For this example I have a spreadsheet named `populations.xls` containing data in the first tab, named `data`.

In [15]:
df = pd.read_excel('populations.xls', sheet_name='data')
df

Unnamed: 0,Geographic Area,"July 1, 2001 Estimate","July 1, 2000 Estimate","April 1, 2000 Population Estimates Base"
0,United States,284796887,282124631,281421906
1,Alabama,4464356,4451493,4447100
2,Alaska,634892,627601,626932
3,Arizona,5307331,5165274,5130632
4,Arkansas,2692090,2678030,2673400
5,California,34501130,34000446,33871648
6,Colorado,4417714,4323410,4301261
7,Connecticut,3425074,3410079,3405565
8,Delaware,796165,786234,783600
9,District of Columbia,571822,571066,572059


Again, we could manipulate this data like we did before, but again, we could also use various arguments in the `read_excel` function instead:

In [16]:
df = pd.read_excel(
    'populations.xls',
    header=0,
    names=['region', '2001', '2000'],
    usecols=[0, 1, 2],
    index_col=0,    
)
df

Unnamed: 0_level_0,2001,2000
region,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,284796887,282124631
Alabama,4464356,4451493
Alaska,634892,627601
Arizona,5307331,5165274
Arkansas,2692090,2678030
California,34501130,34000446
Colorado,4417714,4323410
Connecticut,3425074,3410079
Delaware,796165,786234
District of Columbia,571822,571066


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 52 entries, United States to Wyoming
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   2001    52 non-null     int64
 1   2000    52 non-null     int64
dtypes: int64(2)
memory usage: 1.2+ KB


You can find more information about loading Excel files here:


https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

There is a whole plethora of different readers available to load data from a variety of sources, which you can see here:

https://pandas.pydata.org/pandas-docs/stable/reference/io.html