### Test Scores

Create a series of 10 elements, random integers from 70-100, representing scores on a monthly exam. Set the index to be the month names, starting in September and ending in June.

In [1]:
import pandas as pd
import numpy as np
np.random.seed(0)
import math

In [2]:
months = "Sep Oct Nov Dec Jan Feb Mar Apr May Jun".split()


In [3]:
monthly_exam_scores = pd.Series(np.random.randint(70, 100, 10), index = months)
monthly_exam_scores

Sep    82
Oct    85
Nov    91
Dec    70
Jan    73
Feb    97
Mar    73
Apr    77
May    79
Jun    89
dtype: int64

In [4]:
# What is the student average scores
student_avg_score = f'Student Yearly Average: {monthly_exam_scores.mean()}'

In [5]:
student_avg_score

'Student Yearly Average: 81.6'

In [6]:
# What is the student’s average test score during the first 
# half of the year (i.e., the first five months)?

student_first_half_avg = monthly_exam_scores["Sep" : "Jan"].mean()
student_first_half_avg 

80.2

In [7]:
# What is the student’s average test score 
# during the second half of the year?
student_second_half_avg = monthly_exam_scores["Feb" : "Jun"].mean()
student_second_half_avg

83.0

In [8]:
# Did the student improve their performance in the second half? 
# If so, then by how much?

performance_improvement = student_second_half_avg - student_first_half_avg
performance_improvement = f"Improvement: {performance_improvement}"
performance_improvement

'Improvement: 2.799999999999997'

#### Beyond the exercise
Retrieving both individual elements and slices from series is a critical skill when working with pandas. Here are three additional exercises to help you understand them better

In [9]:
#In which month did this student get their highest score? 
# Note that there are at least two ways to accomplish this: 
# You can sort the values, taking the largest one, or you 
# can use a boolean ("mask") index to find those rows that 
# match the value of s.max(), the highest value.

# month_with_highest_score = monthly_exam_scores.max()

month_with_highest_score = monthly_exam_scores[
    monthly_exam_scores == monthly_exam_scores.max()]
month_with_highest_score


Feb    97
dtype: int64

In [10]:
# What were this student’s five highest scores in the year?

first_five_high_scores = monthly_exam_scores.sort_values(ascending=False).head(5)
first_five_high_scores


Feb    97
Nov    91
Jun    89
Oct    85
Sep    82
dtype: int64

In [11]:
monthly_exam_scores

Sep    82
Oct    85
Nov    91
Dec    70
Jan    73
Feb    97
Mar    73
Apr    77
May    79
Jun    89
dtype: int64

In [12]:
# Round the student’s scores to the nearest 10. 
# So a score of 82 would be rounded down to 80, 
# but a score of 87 would be rounded up to 90.
monthly_exam_scores_copy = monthly_exam_scores.copy()
def custom_round(x, base=10):
    return int(base * round(float(x)/base))
monthly_exam_scores_copy.apply(lambda x: custom_round(x))

Sep     80
Oct     80
Nov     90
Dec     70
Jan     70
Feb    100
Mar     70
Apr     80
May     80
Jun     90
dtype: int64

In [13]:
monthly_exam_scores.describe()

count    10.000000
mean     81.600000
std       8.834277
min      70.000000
25%      74.000000
50%      80.500000
75%      88.000000
max      97.000000
dtype: float64

### Standard Deviation
The measure of how much the valies in our data set vary from one another. 
In a data set with 0 standard deviation, the values all all identical to one
another. By contrast, a data set with a very large standard deviation will
have values that varyes greatly from th mean value.

### Exercise 2: Scaling Test scores

In [14]:
s = pd.Series("10 20 30".split())
s.dtype
s

0    10
1    20
2    30
dtype: object

In [15]:
s = s.astype(np.int64)
s

0    10
1    20
2    30
dtype: int64

In [16]:
s + (80 - s.mean())

0    70.0
1    80.0
2    90.0
dtype: float64

In [17]:
## Example of vectorized operations
s1 = pd.Series([10, 20, 30, 40])
s2 = pd.Series([100, 200, 300, 400])

s1 + s2

0    110
1    220
2    330
3    440
dtype: int64

In [18]:
## Example of vectorized operations
s1 = pd.Series([10, 20, 30, 40], index=list("abcd"))
s2 = pd.Series([100, 200, 300, 400], index=list("dcba"))
s1 + s2


a    410
b    320
c    230
d    140
dtype: int64

In [19]:
# When you try to add a series to a scalar value. 
# Pandas does something call Broadcasting

s = pd.Series([10, 20, 30, 40], index=list("abcd"))

s + 3

a    13
b    23
c    33
d    43
dtype: int64

In [20]:
# Generate a test a score between 40 and 60
np.random.seed(0)

months = "Sep Oct Nov Dec Jan Feb Mar Apr May Jun".split()
s = pd.Series(np.random.randint(40, 60, 10), 
             index =months)
s

Sep    52
Oct    55
Nov    40
Dec    43
Jan    43
Feb    47
Mar    49
Apr    59
May    58
Jun    44
dtype: int64

In [21]:
# Then add 10 points to them 
s + 10

Sep    62
Oct    65
Nov    50
Dec    53
Jan    53
Feb    57
Mar    59
Apr    69
May    68
Jun    54
dtype: int64

In [22]:
s + (80 - s.mean())

Sep    83.0
Oct    86.0
Nov    71.0
Dec    74.0
Jan    74.0
Feb    78.0
Mar    80.0
Apr    90.0
May    89.0
Jun    75.0
dtype: float64

#### Note
Whenever we perfom an operation on an int and a float, wr get back a float,
even if there's no need for it, as with addition.

#### Beyond the Exercice:
Implementing the other way to scale test scores, by looking at both the 
mean of the scores and their standard deviation.

In [23]:
s

Sep    52
Oct    55
Nov    40
Dec    43
Jan    43
Feb    47
Mar    49
Apr    59
May    58
Jun    44
dtype: int64

In [24]:
s_mean = s.mean()
s_std = s.std()
s_mean + s_std

55.733003292241385

In [25]:
s_std = s_std
s_std

6.733003292241386

In [26]:
s_mean

49.0

In [27]:
# A students gets a score more than mean + std
s[s > (s_mean + s_std)]

Apr    59
May    58
dtype: int64

In [28]:
#  B student gets above mean but less than mean + std
s[(s < s_mean + s_std) & (s > s_mean)]

Sep    52
Oct    55
dtype: int64

In [29]:
# C Student gets below mean and great than mean - std
s[(s > s_mean - s_std) & (s < s_mean)]

Dec    43
Jan    43
Feb    47
Jun    44
dtype: int64

In [30]:
# D students gets mean - std
s[s < s_mean - s_std]

Nov    40
dtype: int64

### Beyond 2
Were there any test scores more than 2 standard deviation above or below the mean

In [31]:
s[(s < s_mean - 2*s_std) | (s > s_mean + 2*s_std)]

Series([], dtype: int64)

### Beyond 3:
How close are the mean and median to one another. 
What does it mean if they are close

In [32]:
s_mean = s.mean()
s_mean 

49.0

In [33]:
s_median = s.median()
s_median

48.0

The mean and median are basically the same, which means that we don't have any large outliers skewing the mean's value. If the mean were much higher than the median, then we would assume we have at least one very high test score. And if the mean were much lower than the median, we could assume we have at least one very low test score.

## Exercise 3 Counting 10s digits
This exercice we want to generate 10 random intereges in the range 0 - 100

In [34]:
np.random.seed(0)
s = pd.Series(np.random.randint(0, 100, 10))
s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [35]:
# Solution 1
(s/10).astype(np.int8)

0    4
1    4
2    6
3    6
4    6
5    0
6    8
7    2
8    3
9    8
dtype: int8

In [36]:
# solution 2 partial
s.astype(str).str.get(-2).fillna("0")

0    4
1    4
2    6
3    6
4    6
5    0
6    8
7    2
8    3
9    8
dtype: object

In [37]:
# Solution 2 Complete
s.astype(str).str.get(-2).fillna("0").astype(np.int8)

0    4
1    4
2    6
3    6
4    6
5    0
6    8
7    2
8    3
9    8
dtype: int8

## Beyond the exercise


In [38]:
np.random.seed(0)
s = pd.Series(np.random.randint(0, 10000, 10))
s

0    2732
1    9845
2    3264
3    4859
4    9225
5    7891
6    4373
7    5874
8    6744
9    3468
dtype: int64

### Beyond 1
What if the range were from 0 - 1000?
How would that change your strategy, if at all?

In [39]:
# Our string method will work just fine here. We can drop the call to
# fillna if we are sure none of the value would be <10.
s.astype(str).str.get(-2).fillna("0").astype(np.int8)

0    3
1    4
2    6
3    5
4    2
5    9
6    7
7    7
8    4
9    6
dtype: int8

## Beyond 2
What is the smallest dtype we should use for our integer

In [40]:
# Lets find the minimum and max value of our series
(s.min(), s.max())

(2732, 9845)

In [41]:
s

0    2732
1    9845
2    3264
3    4859
4    9225
5    7891
6    4373
7    5874
8    6744
9    3468
dtype: int64

In [42]:
# What happens if we use int8
s.astype(np.int8)

0    -84
1    117
2    -64
3     -5
4      9
5    -45
6     21
7    -14
8     88
9   -116
dtype: int8

In [43]:
# what happens when we use unit8
s.astype(np.uint8)

0    172
1    117
2    192
3    251
4      9
5    211
6     21
7    242
8     88
9    140
dtype: uint8

In [44]:
# What happens when we use uint 16
s.astype(np.uint16)

0    2732
1    9845
2    3264
3    4859
4    9225
5    7891
6    4373
7    5874
8    6744
9    3468
dtype: uint16

In [45]:
# Checking what happens when we use int16
s.astype(np.int16)

0    2732
1    9845
2    3264
3    4859
4    9225
5    7891
6    4373
7    5874
8    6744
9    3468
dtype: int16

Given a range from 0 - 10000, the smallest dtype we would use to avoid any problams is uint16 and int16

## Beyond 3
Create a new series, with 10 floating-point values between 0 and 1,000. Find the numbers whose integer component (i.e., ignoring any fractional part) are even.

In [46]:

s = pd.Series(np.random.rand(10) * 1000)
s

0    383.441519
1    791.725038
2    528.894920
3    568.044561
4    925.596638
5     71.036058
6     87.129300
7     20.218397
8    832.619846
9    778.156751
dtype: float64

In [47]:
# Get the modulus (dividing by 2) of the int version of the numbers
# Check which results are 0, and use that as a mask index on 

s[s.astype(np.int16) % 2 == 0]

2    528.894920
3    568.044561
7     20.218397
8    832.619846
9    778.156751
dtype: float64

# SELECTING VALUES WITH BOOLEANS
In Python and other traditional programming languages, we can select elements from a sequence using a combination of for loops and if statements. While you could do that in pandas, you almost certainly don’t want to. Instead, you want to select items using a combination of techniques known as a "boolean index" or a "mask index."

Mask indexes are useful and powerful, but their syntax can take some getting used to.

First, consider that you can retrieve any element of a series via square brackets and an index:

In [48]:
s = pd.Series([10, 20, 30, 40, 50])
s[3]

40

In [49]:
# Instead of passing a single intger, we can pass a list 
# (or numpy array or series) of boolean values

s = pd.Series([10, 20, 30, 40, 50])
s[[True, True, False, False, True]]

0    10
1    20
4    50
dtype: int64

In [50]:
# Using a comparison operator (e.g., ==)
s[s<30]

0    10
1    20
dtype: int64

In [51]:
# Getting more sophisticated
s[s <= s.mean()]

0    10
1    20
2    30
dtype: int64

In [52]:
s.mean()

30.0

In [53]:
# We can ue mask for assignment
s[s <= s.mean()] = 999
s

0    999
1    999
2    999
3     40
4     50
dtype: int64

The techniqpe above is worht learning and internalizing, 
because it is both powerful and efficient

# 1.5 Exercice 4: Descriptive Statistics
The mean, median, and standard deviation are three numbers we can use to get a better picture of our data. But there are some other numbers that we can use to fully understand it. These are collectively known as "descriptive statistics."



In [54]:
# Generate a series of 100,000 floats in a normal distribution
# with mean at 0
# Standard deviation of 100
np.random.seed(0)

s = pd.Series(np.random.normal(0, 100, 100_000))
s1 = s.copy()

In [55]:
# Get the descriptive statistics for this series
s.describe()

count    100000.000000
mean          0.157670
std          99.734467
min        -485.211765
25%         -66.864170
50%           0.172022
75%          67.343870
max         424.177191
dtype: float64

In [56]:
# Replace the minimum value with 5 times the maximum value

s[s == s.min()] = 5*s.max()

In [57]:
s.describe()

count    100000.000000
mean          0.183731
std          99.947900
min        -465.995297
25%         -66.862839
50%           0.174214
75%          67.345174
max        2120.885956
dtype: float64

In [58]:
s.median()

0.17421399102941376

## 1.5.3 Beyond the Exercise

## Beyond 1
Demonstrate that 68%, 95%, and 99.7% of the values in s are indeed within 1, 2, and 3 standard distributions of the mean.

In [59]:
s1.describe()

count    100000.000000
mean          0.157670
std          99.734467
min        -485.211765
25%         -66.864170
50%           0.172022
75%          67.343870
max         424.177191
dtype: float64

In [60]:
# with 1 standard distribution

s[(s > s.mean() - s.std()) &
 (s < s.mean() + s.std())].count()/100_000

0.68567

In [61]:
# within 2 standard distribution
s[(s > s.mean() - 2*s.std()) &
 (s < s.mean() + 2*s.std())].count()/100_000

0.95432

In [62]:
# within 4 standard distribution
s[(s > s.mean() - 3*s.std()) &
 (s < s.mean() + 3*s.std())].count()/100_000

0.99717

### # Beyond 2
Calculate the mean of numbers greater than s.mean(). Then calculate the mean of numbers less than s.mean(). Is the average of these two numbers the same as s.mean()?

In [63]:
(s1[s1 < s1.mean()].mean() + s1[s1 > s1.mean()].mean() ) / 2

0.15290179343802635

In [64]:
s1.describe()

count    100000.000000
mean          0.157670
std          99.734467
min        -485.211765
25%         -66.864170
50%           0.172022
75%          67.343870
max         424.177191
dtype: float64

## Beyond 3
What is th mean of the numbers beyond 3 standard deviations

In [65]:
s1[(s1 < s1.mean() - 3*s1.std()) |
 (s1 > s1.mean() + 3*s1.std())].mean()

-2.3577910861855775

# 1.6 Exercise 5: Monday Temperatures

It’s common to assume that the index in a pandas series is unique. After all, the index in a Python string, list, or tuple is unique, as are the keys in a Python dictionary. But it turns out that a series index can contain repeated values. This turns out to be quite useful in many ways.

In this exercise, I want you to create a series of 28 temperature readings in Celsius, representing four seven-day weeks, randomly selected from a normal distribution with a mean of 20 and a standard deviation of 5, rounded to the nearest integer. (If you’re in a country that measures temperature in Fahrenheit, then just pretend you’re looking at the weather in exotic foreign location, rather than where you live.) The index should start with Sun, continue through Sat, and then repeat Sun through Sat until each temperature has a value.

The question is: What was the mean temperature on Mondays during this period?

In [66]:
np.random.seed(0)
np.random.normal(20, 5, 28)

array([28.82026173, 22.00078604, 24.89368992, 31.204466  , 29.33778995,
       15.1136106 , 24.75044209, 19.24321396, 19.48390574, 22.05299251,
       20.72021786, 27.27136753, 23.80518863, 20.60837508, 22.21931616,
       21.66837164, 27.47039537, 18.97420868, 21.56533851, 15.7295213 ,
        7.23505092, 23.26809298, 24.32218099, 16.2891749 , 31.34877312,
       12.72817163, 20.22879259, 19.06408075])

In [67]:
# create seven element list of strings, with the days of the week:

days = "Sun Mon Tue Wed Thu Fri Sat".split()
days

['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']

In [68]:
np.random.seed(0)
s = pd.Series(np.random.normal(20, 5, 28),
             index=days*4).round().astype(np.int8)

In [69]:
s

Sun    29
Mon    22
Tue    25
Wed    31
Thu    29
Fri    15
Sat    25
Sun    19
Mon    19
Tue    22
Wed    21
Thu    27
Fri    24
Sat    21
Sun    22
Mon    22
Tue    27
Wed    19
Thu    22
Fri    16
Sat     7
Sun    23
Mon    24
Tue    16
Wed    31
Thu    13
Fri    20
Sat    19
dtype: int8

In [70]:
s.loc["Mon"].mean()

21.75

### 1.6.3 Beyond the exercise 

#### Beyond 1
What was the average temperature on the weekends (i.e., Saturdays and Sundays)

In [71]:
# Temperature on Sat and Sun
s.loc[["Sat", "Sun"]]

Sat    25
Sat    21
Sat     7
Sat    19
Sun    29
Sun    19
Sun    22
Sun    23
dtype: int8

In [72]:
# Average of temperature on weekends
s.loc[["Sat", "Sun"]].mean()

20.625

#### Beyond 2
How many times will be the change in temperatue from the previous day be greater than 2 degree

In [73]:
# By default .diff compares with the previous element

s[s.diff() > 2]

Tue    25
Wed    31
Sat    25
Tue    22
Thu    27
Tue    27
Thu    22
Sun    23
Wed    31
Fri    20
dtype: int8

In [74]:
# How many times?
s[s.diff() > 2].count()

10

#### Beyond 
What are the two most common temperatures in our data set, and how often does each appear?


In [75]:
# Two most common temperatures

# value_counts returns a series in which the values from s are 
# the index, the number of appearances is the value, and the
# items are ordered from most common to least common. We can
# then use "head" to get only the 2 most common values.

s.value_counts().head(2)

22    5
19    4
dtype: int64

# FANCY INDEXING


### 1.7 Exercise 6 Passenger Frequency
In this exercise, we’re going to start to look at some real-world data. We’ll be looking at reading from and writing to data in greater depth starting in chapter 3, but we’re going to start here by reading from a file into a series. This is possible with the workhorse pd.read_csv method, which normally returns a data frame but can be coerced into returning a series from a file with the squeeze parameter set to True. (This only works if each line of the file contains a single value, which makes it a CSV file without any commas in it.)

The data we’ll look at is in the file taxi-passenger-count.csv, available along with the other data files used in this course. The data comes from 2015 data I retrieved from New York City’s open data site, from which you can get enormous amounts of information about taxi rides in New York city over the last few years. This file shows the number of passengers in each of 100,000 rides.

Your task in this exercise is to show what percentage of taxi rides had only 1 passengers, vs. the maximum of 6 passengers.

In [78]:
s = pd.read_csv("../data/taxi-passenger-count.csv", squeeze=True, header=None)
s

0       1
1       1
2       1
3       1
4       1
       ..
9994    1
9995    1
9996    1
9997    6
9998    1
Name: 0, Length: 9999, dtype: int64

In [81]:
s[s==1].count()
s[s==6].count()

369

In [82]:
# A far easier way value_counts()
s.value_counts()

1    7207
2    1313
5     520
3     406
6     369
4     182
0       2
Name: 0, dtype: int64

In [83]:
# Frequency of 1- and 6-passenger rides

s.value_counts()[[1,6]]

1    7207
6     369
Name: 0, dtype: int64

In [85]:
# Normalize parameter to give the percentage value not the raw value
s.value_counts(normalize=True)[[1,6]]

1    0.720772
6    0.036904
Name: 0, dtype: float64

# 1.7.3 Beyond the exercise
Let’s analyze our taxi passenger data in a few more ways:

What are the 25%, 50% (median), and 75% quantiles for this data set? Can you guess the results before you execute the code?
What proportion of taxi rides are for 3, 4, 5, or 6 passengers?
Consider that you’re in charge of vehicle licensing for New York taxis. Given these numbers, would more people benefit from smaller taxis that can take only one or two passengers, or larger taxis that can take five or six passengers?

#### Beyond 1
What are the 25%, 50% (median), and 75% quantiles for this data set? Can you guess the results before you execute the code?

In [91]:
# The 25%

s.describe()["25%"]

1.0

In [89]:
# The 50%
s.describe()["50%"]

1.0

In [90]:
# The 75%
s.describe()["75%"]

2.0

In [92]:
# 25%, 50%, 75%
s.describe()[["25%", "50%", "75%"]]

25%    1.0
50%    1.0
75%    2.0
Name: 0, dtype: float64

#### Beyond 2
What proportion of taxi rides are for 3,4,5 or 6 passengers

In [97]:
s.value_counts(normalize=True)[[3,4,5,6]].sum()

0.1477147714771477

#### Beyond 3
Consider that you’re in charge of vehicle licensing for New York taxis. Given these numbers, would more people benefit from smaller taxis that can take only one or two passengers, or larger taxis that can take five or six passengers?

In [96]:
s.value_counts()

1    7207
2    1313
5     520
3     406
6     369
4     182
0       2
Name: 0, dtype: int64

Given that a huge proportion of then ride are 1 or 2, licensing more smaller taxi would benefit smaller taxi