<img src="https://www.mca.org.uk/wp-content/uploads/sites/60/2020/01/Carnall-Farrar-Inspiring-change.png" style="float: left; margin: 20px; height: 55px">

# Hands on with Pandas

---

### Learning Objectives

**After this section, you will be able to:**
- Load in data from excel or csv files into a pandas dataframe
- Understand useful pandas methods for descriptive statistics
- Filtering, sorting and renaming data
- Handling missing data
- Downloading a pandas dataframe to excel


In [1]:
# Load Pandas into Python
import pandas as pd

<a id="reading-files"></a>
### Reading Files, Selecting Columns, and Summarizing

In [24]:
df = pd.read_csv('~/Downloads/referrals_oct19_dec20.csv')

**Examine the users data.**

In [25]:
type(df)             # check its type

pandas.core.frame.DataFrame

In [26]:
df                   # Print the first 30 and last 30 rows.

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals
0,2019-10-07,00L,(blank),Routine,13
1,2019-10-07,00L,(blank),Urgent,1
2,2019-10-07,00L,2WW,2 Week Wait,349
3,2019-10-07,00L,Allergy,Routine,3
4,2019-10-07,00L,Cardiology,Routine,84
...,...,...,...,...,...
592679,2020-12-21,99M,Surgery - Not Otherwise Specified,Urgent,2
592680,2020-12-21,99M,Surgery - Vascular,Routine,2
592681,2020-12-21,99M,Surgery - Vascular,Urgent,2
592682,2020-12-21,99M,Urology,Routine,25


In [27]:
df.head()            # Print the first five rows.

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals
0,2019-10-07,00L,(blank),Routine,13
1,2019-10-07,00L,(blank),Urgent,1
2,2019-10-07,00L,2WW,2 Week Wait,349
3,2019-10-07,00L,Allergy,Routine,3
4,2019-10-07,00L,Cardiology,Routine,84


In [28]:
df.head(10)          # Print the first 10 rows.

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals
0,2019-10-07,00L,(blank),Routine,13
1,2019-10-07,00L,(blank),Urgent,1
2,2019-10-07,00L,2WW,2 Week Wait,349
3,2019-10-07,00L,Allergy,Routine,3
4,2019-10-07,00L,Cardiology,Routine,84
5,2019-10-07,00L,Cardiology,Urgent,38
6,2019-10-07,00L,Children's & Adolescent Services,Routine,112
7,2019-10-07,00L,Children's & Adolescent Services,Urgent,15
8,2019-10-07,00L,Dermatology,Routine,81
9,2019-10-07,00L,Dermatology,Urgent,9


In [9]:
df.tail()            # Print the last five rows.

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals
592679,2020-12-21,99M,Surgery - Not Otherwise Specified,Urgent,2
592680,2020-12-21,99M,Surgery - Vascular,Routine,2
592681,2020-12-21,99M,Surgery - Vascular,Urgent,2
592682,2020-12-21,99M,Urology,Routine,25
592683,2020-12-21,99M,Urology,Urgent,11


Sample is a useful one

In [35]:
df.sample(5).transpose()

Unnamed: 0,77087,237066,268112,81225,245596
week_start,2019-11-18,2020-02-24,2020-03-16,2019-11-25,2020-03-02
ccg_code,36J,70F,11J,01A,11J
specialty,Dermatology,Neurology,"Ear, Nose & Throat",Geriatric Medicine,Genetics
priority,Urgent,Routine,Urgent,Urgent,Urgent
referrals,8,45,20,2,1


In [30]:
# The row index (aka "the row labels" — in this case integers)
df.index            

RangeIndex(start=0, stop=592684, step=1)

In [31]:
# Column names (which is "an index")
df.columns

Index(['week_start', 'ccg_code', 'specialty', 'priority', 'referrals'], dtype='object')

In [12]:
# Datatypes of each column — each column is stored as an 
# ndarray, which has a datatype
df.dtypes

week_start    object
ccg_code      object
specialty     object
priority      object
referrals      int64
dtype: object

In [13]:
# Number of rows and columns
df.shape

(592684, 5)

In [14]:
# All values as a NumPy array
df.values

array([['2019-10-07', '00L', '(blank)', 'Routine', 13],
       ['2019-10-07', '00L', '(blank)', 'Urgent', 1],
       ['2019-10-07', '00L', '2WW', '2 Week Wait', 349],
       ...,
       ['2020-12-21', '99M', 'Surgery - Vascular', 'Urgent', 2],
       ['2020-12-21', '99M', 'Urology', 'Routine', 25],
       ['2020-12-21', '99M', 'Urology', 'Urgent', 11]], dtype=object)

In [33]:
# Concise summary (including memory usage) — 
# useful to quickly see if nulls exist
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 592684 entries, 0 to 592683
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   week_start  592684 non-null  object
 1   ccg_code    592684 non-null  object
 2   specialty   592684 non-null  object
 3   priority    592684 non-null  object
 4   referrals   592684 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 22.6+ MB


** Select or index data.**<br>
Pandas `DataFrame`s have structural similarities with Python-style lists and dictionaries.  
In the example below, we select a column of data using the name of the column in a similar manner to how we select a dictionary value with the dictionary key.

##

In [16]:
# Select a column
df['priority']

0             Routine
1              Urgent
2         2 Week Wait
3             Routine
4             Routine
             ...     
592679         Urgent
592680        Routine
592681         Urgent
592682        Routine
592683         Urgent
Name: priority, Length: 592684, dtype: object

In [17]:
type(df['week_start'])

pandas.core.series.Series

**Summarize (describe) the data.**<br>
Pandas has a bunch of built-in methods to quickly summarize your data and provide you with a quick general understanding.

In [19]:
# Describe all numeric columns.
df.describe()

Unnamed: 0,referrals
count,592684.0
mean,28.039053
std,72.963456
min,1.0
25%,2.0
50%,7.0
75%,23.0
max,1985.0


In [20]:
# Describe all columns, including non-numeric.
df.describe(include='all')

Unnamed: 0,week_start,ccg_code,specialty,priority,referrals
count,592684,592684,592684,592684,592684.0
unique,64,135,57,3,
top,2019-10-14,91Q,Cardiology,Routine,
freq,11631,13496,22472,331289,
mean,,,,,28.039053
std,,,,,72.963456
min,,,,,1.0
25%,,,,,2.0
50%,,,,,7.0
75%,,,,,23.0


In [21]:
# Describe a single column — recall that "users.occupation" 
# refers to a Series.
df["referrals"].describe()

count    592684.000000
mean         28.039053
std          72.963456
min           1.000000
25%           2.000000
50%           7.000000
75%          23.000000
max        1985.000000
Name: referrals, dtype: float64

In [37]:
# Calculate the mean of the ages.
df["referrals"].mean()

28.03905285109772

**Count the number of occurrences of each value.**

In [23]:
df["specialty"].value_counts()     # Most useful for categorical variables

Cardiology                              22472
GI and Liver (Medicine and Surgery)     20704
Children's & Adolescent Services        20060
Gynaecology                             20008
Ear, Nose & Throat                      19987
Urology                                 19864
Dermatology                             19822
Orthopaedics                            19768
Neurology                               19720
Ophthalmology                           19510
Rheumatology                            19474
Respiratory Medicine                    18750
Endocrinology and Metabolic Medicine    18506
Surgery - Vascular                      18455
Haematology                             18255
Surgery - Not Otherwise Specified       17692
Surgery - Breast                        16987
Nephrology                              15973
Pain Management                         15526
Oral and Maxillofacial Surgery          14629
Diabetic Medicine                       14245
Neurosurgery                      

<a id="filtering-and-sorting"></a>
### Filtering and Sorting
- **Objective:** Filter and sort data using Pandas.

We can use simple operator comparisons on columns to extract relevant or drop irrelevant information.

**Logical filtering: Only show users with age < 20.**

In [None]:
# Create a Series of Booleans…
# In Pandas, this comparison is performed element-wise 
# on each row of data.
young_bool = users["age"] < 20
young_bool

In [None]:
# …and use that Series to filter rows.
# In Pandas, indexing a DataFrame by a Series of Booleans 
# only selects rows that are True in the Boolean.
users[young_bool]

In [None]:
# Or, combine into a single step.
users[users["age"] < 20]

In [None]:
# Important: This creates a view of the original DataFrame, 
# not a new DataFrame.
# If you alter this view (e.g., by storing it in a variable and 
# altering that)
# You will alter only the slice of the DataFrame and not 
# the actual DataFrame itself
# Here, notice that Pandas gives you a SettingWithCopyWarning 
# to alert you of this.

# It is best practice to use .loc and .iloc instead of the syntax below

users_under20 = users[users["age"] < 20]   
# To resolve this warning, copy the `DataFrame` using `.copy()`.
users_under20['is_under_20'] = True

In [None]:
users.head()

In [None]:
users_under20.head()

To create the is_under_20 column in the original DataFrame we could use `.loc`

The syntax is:

`my_dataframe.loc[<filter_condition>, <column>] = <new_value>`

In [None]:
users.loc[users["age"] < 20, "is_under_20"] = True
users.head()

In [None]:
users.loc[users["age"] >= 20, "is_under_20"] = False
users.head()

`.loc` is also useful if you want to filter **both** rows and columns at the same time

In [None]:
# Select one column from the filtered results.
users.loc[users["is_under_20"], "occupation"]

`.iloc` add in too

**Logical filtering with multiple conditions**

In [None]:
# Ampersand for `AND` condition. (This is a "bitwise" `AND`.)
# Important: You MUST put parentheses around each expression 
# because `&` has a higher precedence than `<`.

users[(users["is_under_20"]) & (users["gender"] == 'M')]

In [None]:
# Pipe for `OR` condition. (This is a "bitwise" `OR`.)
# Important: You MUST put parentheses around each expression 
# because `|` has a higher precedence than `<`.

users[(users["is_under_20"]) | (users["age"] > 60)]

In [None]:
# Preferred alternative to multiple `OR` conditions
users[users["occupation"].isin(['doctor', 'lawyer'])]

**Sorting**

In [None]:
# Sort a Series.
users["age"].sort_values()

In [None]:
# Sort a DataFrame by a single column.
users.sort_values('age')

In [None]:
# Use descending order instead.
users.sort_values('age', ascending=False)

In [None]:
# Sort by multiple columns.
users.sort_values(['occupation', 'age'])

<a id="columns"></a>
### Renaming, Adding, and Removing Columns

- **Objective:** Manipulate `DataFrame` columns.

In [None]:
# Read drinks.csv into a DataFrame called "drinks".
drinks = pd.read_csv('data/drinks.csv')

In [None]:
drinks.head()

In [None]:
# Rename one or more columns in a single output using value mapping.
drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'})

In [None]:
# Rename one or more columns in the original DataFrame.
drinks.rename(columns={'beer_servings':'beer', 
                       'wine_servings':'wine'}, inplace=True)

drinks.head()

In [None]:
# Replace all column names using a list of matching length
drink_cols = ['country', 'beer', 'spirit', 'wine', 'litres', 'continent'] 

# Replace during file reading (disables the header from the file)
drinks_renamed = pd.read_csv('data/drinks.csv', header=0, 
                             names=drink_cols)
drinks_renamed.head()

In [None]:
# Replace after file has already been read into Python.
drinks.columns = drink_cols

drinks.head()

**Easy Column Operations**<br>
Rather than having to reference indexes and create for loops to do column-wise operations, Pandas is smart and knows that when we add columns together we want to add the values in each row together.

In [None]:
# Add a new column as a function of existing columns.
drinks['servings'] = drinks["beer"] + drinks["spirit"] + drinks["wine"]
drinks['mL'] = drinks["litres"] * 1000

drinks.head()

**Removing Columns**

In [None]:
# axis=0 for rows, 1 for columns
drinks.drop('mL', axis=1)

In [None]:
drinks.head()

In [None]:
# Drop on the original DataFrame rather than returning a new one.
drinks.drop('mL', axis=1, inplace=True)

drinks.head()

In [None]:
# Drop multiple columns.
drinks.drop(['beer', 'servings'], axis=1)

#### **Handling missing data**

In [None]:
# Missing values are usually excluded in calculations by default.
drinks["continent"].value_counts()              
# Excludes missing values in the calculation

In [None]:
# Includes missing values
drinks["continent"].value_counts(dropna=False)

In [None]:
# Find missing values in a Series.
# True if missing, False if not missing
drinks["continent"].isnull()

In [None]:
# Count the missing values — sum() works because True is 1 and False is 0.
drinks["continent"].isnull().sum()

In [None]:
# Only show rows where continent is not missing.
drinks[drinks["continent"].notnull()]

**Understanding Pandas Axis**

In [None]:
# Sums "down" the 0 axis (rows) — so, we get the sums of each column
drinks.sum(axis=0)

In [None]:
# axis=0 is the default.
drinks.sum()

In [None]:
# Sums "across" the 1 axis (columns) — 
# so, we get the sums of numeric values in the row 
# (beer+spirit+wine+liters+…)
drinks.sum(axis=1)

**Find missing values in a `DataFrame`.**

In [None]:
# Count the missing values in each column — remember by default, axis=0.
print((drinks.isnull().sum()))

**Dropping Missing Values**

In [None]:
# Drop a row if ANY values are missing from any column — can be dangerous!
drinks.dropna()

In [None]:
# Drop a row only if ALL values are missing.
drinks.dropna(how='all')

**Filling Missing Values**<br>
You may have noticed that the continent North America (NA) does not appear in the `continent` column. Pandas read in the original data and saw "NA", thought it was a missing value, and converted it to a `NaN`, missing value.

In [None]:
# Fill in missing values with "NA" — 
# this is dangerous to do without manually verifying them!
drinks["continent"].fillna(value='NA')

In [None]:
# Modifies "drinks" in-place
drinks["continent"].fillna(value='NA', inplace=True)

In [None]:
# Turn off the missing value filter — this is a better approach!
drink_cols = ['country', 'beer', 'spirit', 'wine', 'litres', 'continent']
drinks = pd.read_csv('data/drinks.csv', header=0, 
                     names=drink_cols, na_filter=False)

In [None]:
drinks["continent"].value_counts()   