# Pandas and Alternatives

1. Import pandas aliased as 'pd' and numpy aliased as 'np'

In [1]:
import pandas as pd
import numpy as np

2. Create a DataFrame named 'df' by reading the file 'USCG.Search.Rescue.Stats.csv' using the pandas read_csv method.

In [2]:
df = pd.read_csv('USCG.Search.Rescue.Stats.csv')

3. View the top 5 rows of data using the DataFrame `head()` method.

In [4]:
df.head()

Unnamed: 0,Fiscal Year,Cases,Responses,Sorties,Lives Saved,Lives Lost After CG Notification,Lives Lost Before CG Notification,Total,Lives Unaccounted For
1964,,41525,,2932,,,,,
1965,,38586,,1984,,,,,
1966,,43366,,2629,,,,,
1967,,42225,,3028,,,,,
1968,,46922,,2434,,,,,


4. View the last 5 rows by using the DataFrame `tail()` method.

In [5]:
df.tail()

Unnamed: 0,Fiscal Year,Cases,Responses,Sorties,Lives Saved,Lives Lost After CG Notification,Lives Lost Before CG Notification,Total,Lives Unaccounted For
2011,20512.0,43954,21566.0,3793,259.0,476.0,735.0,392.0,
2012,19787.0,43940,21609.0,4037,284.0,429.0,713.0,440.0,
2013,17803.0,38272,19420.0,3753,226.0,425.0,651.0,252.0,
2014,17508.0,38282,19032.0,3443,170.0,425.0,595.0,308.0,
2015,16456.0,37215,18781.0,3536,169.0,434.0,603.0,330.0,


5. View the values in the 'Cases' column using  dot syntax, bracket syntax, `loc[]` or `iloc[]`.

In [7]:
df.loc[:, 'Cases'].head()

1964    41525
1965    38586
1966    43366
1967    42225
1968    46922
Name: Cases, dtype: int64

6. Use `describe()` to view the summary statistics for the DataFrame.

In [8]:
df.describe()

Unnamed: 0,Fiscal Year,Cases,Responses,Sorties,Lives Saved,Lives Lost After CG Notification,Lives Lost Before CG Notification,Total,Lives Unaccounted For
count,46.0,52.0,46.0,52.0,46.0,37.0,46.0,16.0,0.0
mean,46296.608696,58013.769231,67666.586957,4339.230769,670.956522,508.486486,1079.956522,468.0,
std,17438.646933,13480.714228,29300.537271,1334.134847,499.839128,134.761028,394.869765,149.916866,
min,16456.0,37215.0,18781.0,1984.0,169.0,180.0,533.0,252.0,
25%,31676.25,46632.75,33202.75,3348.5,281.75,425.0,751.0,336.75,
50%,50621.5,55945.5,81711.5,4221.0,383.5,492.0,998.0,437.5,
75%,57072.75,69049.75,88433.75,5484.5,1118.75,593.0,1440.75,584.25,
max,77954.0,86222.0,110267.0,7889.0,1783.0,800.0,1821.0,732.0,


7. You can filter for particular values by comparing a colum to a value within the square bracket syntax. This creates a mask on the fly. Lets look at all of the rows whose case count is higher than the mean. You can get this number from the summary statistics above. 

In [10]:
df[df.Cases > df.describe().loc['mean', 'Cases']].head(15)

Unnamed: 0,Fiscal Year,Cases,Responses,Sorties,Lives Saved,Lives Lost After CG Notification,Lives Lost Before CG Notification,Total,Lives Unaccounted For
1972,51539.0,60328,72306.0,2633,1389.0,,1389.0,,
1973,55107.0,64182,77209.0,2918,1474.0,,1474.0,,
1974,59335.0,67692,79950.0,2751,1509.0,,1509.0,,
1975,62334.0,70551,81561.0,3024,1254.0,,1254.0,,
1976,67179.0,75069,87807.0,2995,1112.0,,1112.0,,
1977,74637.0,82601,96021.0,4121,1458.0,,1458.0,,
1978,77954.0,86222,100262.0,4386,1556.0,,1556.0,,
1979,72517.0,79858,92117.0,5747,949.0,672.0,1621.0,,
1980,73345.0,81476,93726.0,6868,1235.0,586.0,1821.0,,
1981,71781.0,78951,91432.0,6339,1080.0,637.0,1717.0,,


9. Now lets create a NumPy array with the same data. Pandas DataFrames have a `to_numpy()` method. Use this method to create an array named 'np_array'.

In [11]:
np_array = df.to_numpy()

10. Call the shape attribute on the array.

In [12]:
np_array.shape

(52, 9)

11. Use the array `reshape()` method to return a 4 x 13 x 9 array (the arguments to the method will be these numbers) .

In [13]:
np_array.reshape(4,13,9)

array([[[    nan,  41525.,     nan,   2932.,     nan,     nan,     nan,
             nan,     nan],
        [    nan,  38586.,     nan,   1984.,     nan,     nan,     nan,
             nan,     nan],
        [    nan,  43366.,     nan,   2629.,     nan,     nan,     nan,
             nan,     nan],
        [    nan,  42225.,     nan,   3028.,     nan,     nan,     nan,
             nan,     nan],
        [    nan,  46922.,     nan,   2434.,     nan,     nan,     nan,
             nan,     nan],
        [    nan,  48720.,     nan,   2050.,     nan,     nan,     nan,
             nan,     nan],
        [ 44975.,  52183.,  62286.,   4135.,   1783.,     nan,   1783.,
             nan,     nan],
        [ 48894.,  56181.,  68251.,   2423.,   1324.,     nan,   1324.,
             nan,     nan],
        [ 51539.,  60328.,  72306.,   2633.,   1389.,     nan,   1389.,
             nan,     nan],
        [ 55107.,  64182.,  77209.,   2918.,   1474.,     nan,   1474.,
             nan,     nan],


12. Import the dask.dataframe module aliased as 'dd'

In [15]:
import dask.dataframe as dd

13. the `dask.dataframe` module has a `read_csv()` method which works in a similar fasion to the Pandas one. Use this method to read the file 'USCG.Search.Rescue.Stats.csv' into a dask DataFrame named 'ddf'

In [16]:
ddf = dd.read_csv('USCG.Search.Rescue.Stats.csv')

14. Call the DataFrames `std()` method.

In [17]:
ddf.std()

Dask Series Structure:
npartitions=1
Cases    float64
Total        ...
dtype: float64
Dask Name: dataframe-std, 9 tasks

15. Notice that this did not calculate the standard deviation due to dask's use of lazy evaluation. add a `.compute()` after the `std()` to compute the result.

In [18]:
ddf.std().compute()

Fiscal Year                          17438.646933
Cases                                13480.714228
Responses                            29300.537271
Sorties                               1334.134847
Lives Saved                            499.839128
Lives Lost After CG Notification       134.761028
Lives Lost Before CG Notification      394.869765
Total                                  149.916866
Lives Unaccounted For                         NaN
dtype: float64

In [19]:
df.std()

Fiscal Year                          17438.646933
Cases                                13480.714228
Responses                            29300.537271
Sorties                               1334.134847
Lives Saved                            499.839128
Lives Lost After CG Notification       134.761028
Lives Lost Before CG Notification      394.869765
Total                                  149.916866
Lives Unaccounted For                         NaN
dtype: float64