# Basic Pandas Analysis

This notebook has 8 basic Pandas tasks for loading data and accessing rows and columns. You can enter your solution to each task in the cells with the comment "# Solution". Then execute the solution cell, and the testing cell after it. The testing cell will raise an Assertion error if your solution is incorrect. 

*Important: Make a copy of this notebook* and move the copy into your User directory on the Shared drive. * Otherwise, your change will appear here. You can also ue the File menu to open this notebook in "Plaground Mode", but then none of your changes will be saved. 

If you are having trouble, see the solution notebook, named "Basic_Analysis_Solved.ipynb", or the previous demonstration notebooks: 

* [Python and Pandas Demo](https://colab.research.google.com/drive/1kERiaL3RuoimT3f6kinrBIc-DvjR1qXA?usp=sharing)
* [Basic POLA Analysis](https://colab.research.google.com/drive/1LfGLBr31F2TRILo_nSXNsYO0t3VRlqV8?usp=sharing)
    




# Task 1: Open a CSV file in a New Dataframe

* Import the Pandas module
* Load [this CSV file](http://library.metatab.org/sandiegodata.org-beachwatch-4/data/beachwatch.csv) using the [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function.
** The file url is: http://library.metatab.org/sandiegodata.org-beachwatch-4/data/beachwatch.csv
* Call the new dataframe `df`


In [1]:
# Solution Here
import pandas as pd
df = pd.read_csv('http://library.metatab.org/sandiegodata.org-beachwatch-4/data/beachwatch.csv')

In [2]:
# Testing your solution; do not edit
from IPython.display import HTML
assert 'pd' in locals(), "Didn't find a module imported as pd"
assert str(type(pd)) == "<class 'module'>", 'Pandas module is not imported as pd'
assert type(df) == pd.DataFrame, 'df is not a DataFrame'
assert len(df) == 202257, f"df doesn't have as many records as it should have. It has {len(df)} and should have 202257"
HTML('<p style="color:green; font-size:30pt">Success!</p>')

# Task 2: Display the first 10 rows of the dataset 

* Use the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html?highlight=head#pandas.DataFrame.head) function

In [3]:
# Solution
df.head(10)

Unnamed: 0,stationcode,stationgroup,sampledate,collectiontime,measure_code,analyte,methodname,unit,result,result_group_count,result_group_std,result_group_mean,result_group_25pctl,result_group_median,result_group_75pctl,result_gt_median,result_gt_mean,result_lte_25pctl,result_gte_75pctl,lresult,lresult_group_std,lresult_group_mean,lresult_group_25pctl,lresult_group_median,lresult_group_75pctl,lresult_gt_lmedian,lresult_gt_lmean,lresult_lte_25pctl,lresult_gte_75pctl,labbatch,resultqualcode,qacode,sampleagency,labagency,submittingagency
0,EH-010,EH,1999-05-26 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-5/26/1999,<,NR,SDCDEH,SDCDEH,SDCDEH
1,EH-010,EH,1999-10-13 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,0.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,1.0,0.0,,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,0.0,SH-10/13/1999,=,NR,SDCDEH,SDCDEH,SDCDEH
2,EH-010,EH,1999-10-26 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-10/26/1999,<,NR,SDCDEH,SDCDEH,SDCDEH
3,EH-010,EH,2000-03-21 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-3/21/2000,<,NR,SDCDEH,SDCDEH,SDCDEH
4,EH-010,EH,2000-05-24 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,230.0,213,1647.469843,376.478873,20.0,20.0,20.0,1.0,0.0,0.0,1.0,5.438079,1.439884,3.633361,2.995732,2.995732,2.995732,1.0,1.0,0.0,1.0,SH-5/24/2000,=,NR,SDCDEH,SDCDEH,SDCDEH
5,EH-010,EH,2000-05-25 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-5/25/2000,=,NR,SDCDEH,SDCDEH,SDCDEH
6,EH-010,EH,2000-05-28 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-5/28/2000,<,NR,SDCDEH,SDCDEH,SDCDEH
7,EH-010,EH,2000-06-07 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-6/7/2000,<,NR,SDCDEH,SDCDEH,SDCDEH
8,EH-010,EH,2000-06-26 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-6/26/2000,<,NR,SDCDEH,SDCDEH,SDCDEH
9,EH-010,EH,2000-07-19 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-7/19/2000,<,NR,SDCDEH,SDCDEH,SDCDEH


In [4]:
# Test
assert type(_)== pd.DataFrame, "Last cells output isn't a DataFrame"
assert len(_) == 10, f"Output should display 10 rows, got {len(_)} instead"
HTML('<p style="color:green; font-size:30pt">Success!</p>')

# Task 3: Display the 10 head rows, but transposed. 

* Use head() and T

In [5]:
# Solution
df.head().T

Unnamed: 0,0,1,2,3,4
stationcode,EH-010,EH-010,EH-010,EH-010,EH-010
stationgroup,EH,EH,EH,EH,EH
sampledate,1999-05-26 00:00:00,1999-10-13 00:00:00,1999-10-26 00:00:00,2000-03-21 00:00:00,2000-05-24 00:00:00
collectiontime,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
measure_code,0,0,0,0,0
analyte,"Coliform, Fecal","Coliform, Fecal","Coliform, Fecal","Coliform, Fecal","Coliform, Fecal"
methodname,MTF,MTF,MTF,MTF,MTF
unit,MPN/100 mL,MPN/100 mL,MPN/100 mL,MPN/100 mL,MPN/100 mL
result,20,0,20,20,230
result_group_count,213,213,213,213,213


In [6]:
# Test
assert len(_)==35, f"The transposed DataFrame should be 35 rows long, Got {len(_)} instead"
HTML('<p style="color:green; font-size:30pt">Success!</p>')

# Task 4: Describe the dataframe

* Use the describe() function 

In [7]:
# Solution
df.describe()

Unnamed: 0,measure_code,result,result_group_count,result_group_std,result_group_mean,result_group_25pctl,result_group_median,result_group_75pctl,result_gt_median,result_gt_mean,result_lte_25pctl,result_gte_75pctl,lresult,lresult_group_std,lresult_group_mean,lresult_group_25pctl,lresult_group_median,lresult_group_75pctl,lresult_gt_lmedian,lresult_gt_lmean,lresult_lte_25pctl,lresult_gte_75pctl
count,202257.0,198271.0,202257.0,201846.0,202222.0,202222.0,202222.0,202222.0,202257.0,202257.0,202257.0,202257.0,197910.0,201839.0,202216.0,202216.0,202216.0,202216.0,202257.0,202257.0,202257.0,202257.0
mean,12.996346,22699.6,317.78465,39280.77,23009.28,791.7807,6229.091,32921.37,0.327015,0.118295,0.043509,0.483207,2.877065,1.36258,2.891861,2.006074,2.433144,3.372824,0.326683,0.30745,0.042278,0.482629
std,9.650773,573147.6,220.945887,466989.2,339992.3,46347.59,169820.4,559376.4,0.469124,0.322958,0.204,0.499719,2.034466,0.661647,1.364789,1.271805,1.434257,1.815852,0.469002,0.461439,0.201223,0.499699
min,0.0,-10.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5.0,4.0,151.0,104.2986,27.14714,2.0,4.0,10.0,0.0,0.0,0.0,0.0,1.386294,0.910135,2.155533,0.693147,1.386294,2.302585,0.0,0.0,0.0,0.0
50%,11.0,20.0,260.0,408.0027,87.56272,10.0,10.0,20.0,0.0,0.0,0.0,0.0,2.995732,1.259063,2.966528,2.302585,2.302585,2.995732,0.0,0.0,0.0,0.0
75%,24.0,30.0,480.0,1617.303,317.5333,20.0,20.0,67.5,1.0,0.0,0.0,1.0,3.401197,1.709912,3.525028,2.995732,2.995732,4.2019,1.0,1.0,0.0,1.0
max,30.0,28000000.0,1081.0,12359000.0,14305670.0,9362500.0,17329000.0,24196000.0,1.0,1.0,1.0,1.0,17.147715,6.354908,15.939515,15.408507,16.667892,17.001698,1.0,1.0,1.0,1.0


In [8]:
# Test
assert 'mean' in _.index, "The last cell's output doesn't look like the output from describe()"
HTML('<p style="color:green; font-size:30pt">Success!</p>')

# Task 5: What is the most common station code?

* Use the value_counts() function


In [9]:
# Solution 
df.stationcode.value_counts()

IB-080    5065
EH-420    4417
FM-010    4150
SE-060    3780
OC-100    3614
          ... 
EH-270       6
OC-035       3
EH-053       3
EH-312       3
EH-150       2
Name: stationcode, Length: 172, dtype: int64

In [10]:
# Test
assert _.iloc[0] == 5065, "Last cell's output doesn't look right...."
assert _.index[0] == 'IB-080', "Last cell's output doesn look right...."
HTML('<p style="color:green; font-size:30pt">Success!</p>')

# Task 6: Create a dataframe of just the records from the most common station

* [Filter the dataframe](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-filter-specific-rows-from-a-dataframe) for the rows with the most common station code. 
* Call the dataframe 'ib'
* Display the head of the dataframe after assigning it. 

In [11]:
# Solution
ib = df[df.stationcode == 'IB-080']
ib.head()

Unnamed: 0,stationcode,stationgroup,sampledate,collectiontime,measure_code,analyte,methodname,unit,result,result_group_count,result_group_std,result_group_mean,result_group_25pctl,result_group_median,result_group_75pctl,result_gt_median,result_gt_mean,result_lte_25pctl,result_gte_75pctl,lresult,lresult_group_std,lresult_group_mean,lresult_group_25pctl,lresult_group_median,lresult_group_75pctl,lresult_gt_lmedian,lresult_gt_lmean,lresult_lte_25pctl,lresult_gte_75pctl,labbatch,resultqualcode,qacode,sampleagency,labagency,submittingagency
107412,IB-080,IB,2000-02-24 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,318,285.160391,47.295597,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,0.625001,3.161183,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-2/24/2000,=,NR,SDCDEH,SDCDEH,SDCDEH
107413,IB-080,IB,2000-05-30 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,230.0,318,285.160391,47.295597,20.0,20.0,20.0,1.0,1.0,0.0,1.0,5.438079,0.625001,3.161183,2.995732,2.995732,2.995732,1.0,1.0,0.0,1.0,CR-5/30/2000,=,NR,CNR,CNR,CNR
107414,IB-080,IB,2000-07-12 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,318,285.160391,47.295597,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,0.625001,3.161183,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,CR-7/12/2000,<,NR,CNR,CNR,CNR
107415,IB-080,IB,2000-08-02 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,318,285.160391,47.295597,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,0.625001,3.161183,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,CR-8/2/2000,<,NR,CNR,CNR,CNR
107416,IB-080,IB,2000-08-10 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,318,285.160391,47.295597,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,0.625001,3.161183,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,CR-8/10/2000,<,NR,CNR,CNR,CNR


In [12]:
# Test
assert 'ib' in locals(), "Didn't find a variable named 'ib' "
assert type(ib) == pd.DataFrame, "The 'ib' variable isn't a dataframe"
assert len(ib) == 5065, "The dataframe 'ib' is the wrong length. It should be 5065 records long "
assert all(ib.stationcode == 'IB-080'), "The 'ib' data frame doesn't have only the IB-080 station code"
assert all(_.stationcode == 'IB-080'), 'THe final output of the cell should be the ib dataframe'
HTML('<p style="color:green; font-size:30pt">Success!</p>')

# Task 7: Convert the samepledate column to a real datetime object

* Use [pd.to_datetime()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)
* Convert the value for all records in the `df` DataFrame, not in the `ib` subset
* Assign the converted value to a column named 'date' on the ``df`` dataframe
* Re-create the `ib`, as you did in Task 6, so the ``ib`` dataframe now has the converted dates. 

To solve this task, you will need to use the pd.to_datetime() function and a [boolean indexer](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing).


In [39]:
# Solution

df['date'] = pd.to_datetime(df.sampledate)
ib = df[df.stationcode == 'IB-080']

  This is separate from the ipykernel package so we can avoid doing imports until


In [14]:
# Test
assert 'date' in list(df.columns), "The 'df' DataFrame does not have a 'date' column"
assert str(df.date.dtype) == 'datetime64[ns]'
assert 'date' in list(ib.columns), "The 'ib' DataFrame does not have a 'date' column"
assert type(ib) == pd.DataFrame, "The 'ib' variable isn't a dataframe"
assert len(ib) == 5065, "The dataframe 'ib' is the wrong length. It should be 5065 records long "
assert all(ib.stationcode == 'IB-080'), "The 'ib' data frame doesn't have only the IB-080 station code"
HTML('<p style="color:green; font-size:30pt">Success!</p>')

# Task 8:

* Select a subset of rows from `df` that have:
 * Station code IB-080
 * Analyte of Enterococcus
 * Measured using the Enterolert test
 * Measured in units of MPN/100 mL
* From the subset, select only the columns `date` and `result`
* Name the resulting dataframe `entr`
* Display the head of the dataframe

Remember that if [you select rows with more than one boolean indexer](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing) you have to bitwise AND them together with  the `&` operator, and each boolean clause must have parenthesis around it. So, for instance: 

```
     df[ (df.stationcode == 'IB-080') & (df.measure_code == 3)]
```

This statement is equivalent to:

```
stc_bool = (df.stationcode == 'IB-080')  # Series with TRUE for every row where the station code is IB-080
nc_bool =  (df.measure_code == 3)        # Series with TRUE for every row where the measure code is 3

index_bool = stc_bool & nc_bool          # Series with TRUE for every row where stc_bool and nc_bool are both TRUE

df[index_bool]  # select rows from DF where index_bool is TRUE
```




In [15]:
# Solution

# Task 8:
entr = df[(df.stationcode == 'IB-080') & (df.methodname == 'Enterolert') & (df.analyte == 'Enterococcus') & (df.unit == 'MPN/100 mL')] 
entr = entr[['date','result']]
entr.head()

Unnamed: 0,date,result
110949,2000-04-21,10.0
110950,2001-01-14,10.0
110951,2001-02-16,42.0
110952,2001-02-26,87.0
110953,2001-03-01,31.0


In [16]:
#Test
assert 'entr' in locals(), "Didn't find a variable named 'entr' "
assert _.result.sum() == 180, f'The sum of the result column of the head of displayed table should be 180. Got {_.result.sum()}'
HTML('<p style="color:green; font-size:30pt">Success!</p>')

In [41]:
df.loc

Unnamed: 0,stationcode,stationgroup,sampledate,collectiontime,measure_code,analyte,methodname,unit,result,result_group_count,result_group_std,result_group_mean,result_group_25pctl,result_group_median,result_group_75pctl,result_gt_median,result_gt_mean,result_lte_25pctl,result_gte_75pctl,lresult,lresult_group_std,lresult_group_mean,lresult_group_25pctl,lresult_group_median,lresult_group_75pctl,lresult_gt_lmedian,lresult_gt_lmean,lresult_lte_25pctl,lresult_gte_75pctl,labbatch,resultqualcode,qacode,sampleagency,labagency,submittingagency,date,count
67,EH-010,EH,2001-03-05 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,3000.0,213,1647.469843,376.478873,20.0,20.0,20.0,1.0,1.0,0.0,1.0,8.006368,1.439884,3.633361,2.995732,2.995732,2.995732,1.0,1.0,0.0,1.0,SH-3/5/2001,=,NR,SDCDEH,SDCDEH,SDCDEH,2001-03-05,1
68,EH-010,EH,2001-04-18 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-4/18/2001,<,NR,SDCDEH,SDCDEH,SDCDEH,2001-04-18,1
69,EH-010,EH,2001-05-02 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-5/2/2001,<,NR,SDCDEH,SDCDEH,SDCDEH,2001-05-02,1
70,EH-010,EH,2001-05-16 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-5/16/2001,<,NR,SDCDEH,SDCDEH,SDCDEH,2001-05-16,1
71,EH-010,EH,2001-05-23 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-5/23/2001,<,NR,SDCDEH,SDCDEH,SDCDEH,2001-05-23,1
72,EH-010,EH,2001-06-28 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-6/28/2001,<,NR,SDCDEH,SDCDEH,SDCDEH,2001-06-28,1
73,EH-010,EH,2001-08-01 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-8/1/2001,<,NR,SDCDEH,SDCDEH,SDCDEH,2001-08-01,1
74,EH-010,EH,2001-08-22 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-8/22/2001,<,NR,SDCDEH,SDCDEH,SDCDEH,2001-08-22,1
75,EH-010,EH,2001-08-29 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-8/29/2001,<,NR,SDCDEH,SDCDEH,SDCDEH,2001-08-29,1
76,EH-010,EH,2001-09-10 00:00:00,00:00:00,0,"Coliform, Fecal",MTF,MPN/100 mL,20.0,213,1647.469843,376.478873,20.0,20.0,20.0,0.0,0.0,0.0,1.0,2.995732,1.439884,3.633361,2.995732,2.995732,2.995732,0.0,0.0,0.0,1.0,SH-9/10/2001,<,NR,SDCDEH,SDCDEH,SDCDEH,2001-09-10,1
