In [2]:
import numpy as np
import pandas as pd

**문제 1**

In [33]:
np.random.seed(0)

months = 'Sep Oct Nov Dec Jan Feb Mar Apr May Jun'.split()

s = pd.Series(np.random.randint(70, 100, 10),
              index=months)
s

Sep    82
Oct    85
Nov    91
Dec    70
Jan    73
Feb    97
Mar    73
Apr    77
May    79
Jun    89
dtype: int32

In [26]:
print(f'Entire year average: {s.mean()}')

Entire year average: 85.1


In [27]:
first_half_average = s['Sep':'Jan'].mean()
second_half_average = s['Feb':'Jun'].mean()

print(f'Yearly average: {s.mean()}')
print(f'First half average: {first_half_average}')
print(f'Second half average: {second_half_average}')
print(f'Improvement: {second_half_average - first_half_average}')

Yearly average: 85.1
First half average: 79.6
Second half average: 90.6
Improvement: 11.0


# Beyond 1

In which month did this student get their highest score? Note that there are at least two ways to accomplish this: You can sort the values, taking the largest one, or you can use a boolean ("mask") index to find those rows that match the value of `s.max()`, the highest value.

In [35]:
# Option 1
s.sort_values(ascending=False).index[0]

'Feb'

In [36]:
# Option 2
s[s==s.max()].index[0]

'Feb'

# Beyond 2

What were this student's five highest scores in the year?

In [37]:
s.sort_values(ascending=False).head(5)

Feb    97
Nov    91
Jun    89
Oct    85
Sep    82
dtype: int32

# Beyond 3

Round the student's scores to the nearest 10.  So a score of 82 would be rounded down to 80, but a score of 87 would be rounded up to 90.

In [38]:
# The "round" method, when given a positive integer argument, rounds numbers after the
# decimal point. When given a negative integer argument, it rounds numbers *before* the decimal point!

s.round(-1)  

Sep     80
Oct     80
Nov     90
Dec     70
Jan     70
Feb    100
Mar     70
Apr     80
May     80
Jun     90
dtype: int32

**문제 2**

In [41]:
np.random.seed(0)

months = 'Sep Oct Nov Dec Jan Feb Mar Apr May Jun'.split()

s = pd.Series(np.random.randint(40, 60, 10),
          index=months)
s

Sep    52
Oct    55
Nov    40
Dec    43
Jan    43
Feb    47
Mar    49
Apr    59
May    58
Jun    44
dtype: int32

In [42]:
s + (80 - s.mean())

Sep    83.0
Oct    86.0
Nov    71.0
Dec    74.0
Jan    74.0
Feb    78.0
Mar    80.0
Apr    90.0
May    89.0
Jun    75.0
dtype: float64

# Beyond 1

There's at least one other way to scale test scores, namely by looking at both the mean of the scores and their standard deviation. We can say anyone who scored within 1 standard deviation of the mean got a C (below the mean) or a B (above the mean). Anyone who scored more than 1 standard deviation above the mean got an A, and anyone who got more than one standard deviation below the mean got a D. During which months did our student get an A, B, C, and D?

In [43]:
# A students get are greater than mean + std
s[s > s.mean() + s.std()]

Apr    59
May    58
dtype: int32

In [44]:
# B students are greater than mean, but less than mean+std
s[(s < s.mean() + s.std()) & (s > s.mean())]

Sep    52
Oct    55
dtype: int32

In [45]:
# C students are less than mean, but greater han mean-std
s[(s > s.mean() - s.std()) & (s < s.mean())]

Dec    43
Jan    43
Feb    47
Jun    44
dtype: int32

In [46]:
# D students are less than mean - std
s[s < s.mean() - s.std()]

Nov    40
dtype: int32

# Beyond 2

Were there any test scores more than 2 standard deviations above or below the mean?  If so, in which months?

In [47]:
# Were any test scores mean+2 standard deviations, or mean - 2 standard deviations?
s[(s < s.mean()-2*s.std())  |
  (s > s.mean()+2*s.std())]

# nope, turns out there weren't any!

Series([], dtype: int32)

# Beyond 3

How close are the mean and median to one another? What does it mean if they are close? What would it mean if they are far apart?

In [48]:
s.mean()

49.0

In [49]:
s.median()

48.0

The mean and median are basically the same, which means that we don't have any large outliers skewing the mean's value. If the mean were much higher than the median, then we would assume we have at least one very high test score. And if the mean were much lower than the median, we could assume we have at least one very low test score.

**문제 3**

In [52]:
np.random.seed(0)

s = pd.Series(np.random.randint(0, 100, 10))
s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int32

In [53]:
# solution 1
(s / 10).astype(np.int8)

0    4
1    4
2    6
3    6
4    6
5    0
6    8
7    2
8    3
9    8
dtype: int8

In [54]:
# solution 2, partial
s.astype(str).str.get(-2).fillna('0')

0    4
1    4
2    6
3    6
4    6
5    0
6    8
7    2
8    3
9    8
dtype: object

In [55]:
# solution 2, complete
s.astype(str).str.get(-2).fillna('0').astype(np.int8)

0    4
1    4
2    6
3    6
4    6
5    0
6    8
7    2
8    3
9    8
dtype: int8

In [57]:
np.random.seed(0)

s = pd.Series(np.random.randint(0, 10000, 10))
s

0    2732
1    9845
2    3264
3    4859
4    9225
5    7891
6    4373
7    5874
8    6744
9    3468
dtype: int32

# Beyond 1

What if the range were from 0 - 10,000? How would that change your strategy, if at all?

In [58]:
# Our string strategy will work just fine here! If none of the numbers
# are <10, then we can even remove the call to "fillna", but I think that 
# it's wiser to keep that around.

s.astype(str).str.get(-2).fillna('0').astype(np.int8)


0    3
1    4
2    6
3    5
4    2
5    9
6    7
7    7
8    4
9    6
dtype: int8

# Beyond 2

Given a range from 0 to 10,000, what's the smallest `dtype` we could use for our integers?

In [59]:
# Let's find the min and max values for our pd.Series:

print(s.min(), s.max())

2732 9845


In [60]:
# What happens if we use int8?
s.astype(np.int8)

0    -84
1    117
2    -64
3     -5
4      9
5    -45
6     21
7    -14
8     88
9   -116
dtype: int8

In [61]:
# What happens if we use uint8?
s.astype(np.uint8)

0    172
1    117
2    192
3    251
4      9
5    211
6     21
7    242
8     88
9    140
dtype: uint8

In [62]:
# So it seems we really need to use either np.int16 or np.uint16 to avoid problems!
s.astype(np.int16)

0    2732
1    9845
2    3264
3    4859
4    9225
5    7891
6    4373
7    5874
8    6744
9    3468
dtype: int16

# Beyond 3

Create a new pd.Series, with 10 floating-point values between 0 and 1,000. Find the numbers whose integer component (i.e., ignoring any fractional part) are even.

In [63]:
# First, create the pd.Series
s = pd.Series(np.random.rand(10) * 1000)
s

0    383.441519
1    791.725038
2    528.894920
3    568.044561
4    925.596638
5     71.036058
6     87.129300
7     20.218397
8    832.619846
9    778.156751
dtype: float64

In [64]:
# Get the modulus (dividing by 2) of the int version of the numbers
# Check which results are 0, and use that as a mask index on s

s[s.astype(np.int64) % 2 == 0]

2    528.894920
3    568.044561
7     20.218397
8    832.619846
9    778.156751
dtype: float64

**문제 4**

In [65]:
np.random.seed(0)

s = pd.Series(np.random.normal(0, 100, 100_000))
s

0        176.405235
1         40.015721
2         97.873798
3        224.089320
4        186.755799
            ...    
99995    -33.771476
99996   -202.854844
99997     72.618198
99998   -116.783052
99999   -128.520765
Length: 100000, dtype: float64

In [66]:
s.describe()

count    100000.000000
mean          0.157670
std          99.734467
min        -485.211765
25%         -66.864170
50%           0.172022
75%          67.343870
max         424.177191
dtype: float64

In [67]:
s.loc[s == s.min()] = 5*s.max()

In [68]:
s.describe()

count    100000.000000
mean          0.183731
std          99.947900
min        -465.995297
25%         -66.862839
50%           0.174214
75%          67.345174
max        2120.885956
dtype: float64

In [69]:
np.random.seed(0)

s = pd.Series(np.random.normal(0, 100, 100_000))
s

0        176.405235
1         40.015721
2         97.873798
3        224.089320
4        186.755799
            ...    
99995    -33.771476
99996   -202.854844
99997     72.618198
99998   -116.783052
99999   -128.520765
Length: 100000, dtype: float64

# Beyond 1

Demonstrate that 68%, 95%, and 99.7% of the values in `s` are indeed within 1, 2, and 3 standard distributions of the mean.

In [70]:
# within one standard deviation
s[(s > s.mean() - s.std()) &
  (s < s.mean() + s.std())].count() / 100_000

0.68463

In [71]:
# within two standard deviations
s[(s > s.mean() - 2*s.std()) &
  (s < s.mean() + 2*s.std())].count() / 100_000

0.95382

In [72]:
# within three standard deviations
s[(s > s.mean() - 3*s.std()) &
  (s < s.mean() + 3*s.std())].count() / 100_000

0.99714

# Beyond 2

 Calculate the mean of numbers greater than `s.mean()`. Then calculate the mean of numbers less than `s.mean()`. Is the average of these two numbers the same as `s.mean()`?

In [73]:
(s[s < s.mean()].mean() + s[s > s.mean()].mean() ) / 2

0.1529017934377066

In [74]:
# They're pretty close!c
s.mean()

0.15767005081253402

# Beyond 3

What is the mean of the numbers beyond 3 standard deviations?

In [75]:
# A pretty complex combination of mask indexes,
# but the result is still a pd.Series, on which we can run mean()
s[(s < s.mean() - 3*s.std()) | 
  (s > s.mean() + 3*s.std()) ].mean()

-2.357791086185572

**문제 5**

In [82]:
np.random.normal(20, 5, 28)

array([27.66389607, 27.34679385, 20.77473713, 21.8908126 , 15.56107126,
       10.09601766, 18.26043925, 20.78174485, 26.1514534 , 26.01189924,
       18.06336591, 18.48848625, 14.75723517, 12.89991031, 11.46864905,
       29.75387698, 17.45173909, 17.80962849, 13.7360232 , 23.88745178,
       11.93051076, 18.9362986 , 15.52266719, 21.93451249, 17.44597431,
       14.09683908, 19.85908886, 22.14165935])

In [83]:
days = 'Sun Mon Tue Wed Thu Fri Sat'.split()

np.random.seed(0)
s = pd.Series(np.random.normal(20, 5, 28),      # 평균값, 표준편차, 어레이 모양
          index=days*4).round().astype(np.int8)

In [84]:
s

Sun    29
Mon    22
Tue    25
Wed    31
Thu    29
Fri    15
Sat    25
Sun    19
Mon    19
Tue    22
Wed    21
Thu    27
Fri    24
Sat    21
Sun    22
Mon    22
Tue    27
Wed    19
Thu    22
Fri    16
Sat     7
Sun    23
Mon    24
Tue    16
Wed    31
Thu    13
Fri    20
Sat    19
dtype: int8

In [85]:
s.loc['Mon'].mean()

21.75

# Beyond 1

What was the average temperature on weekends (i.e., Saturdays and Sundays)?

In [88]:
s[['Sun', 'Sat']].mean()

20.625

# Beyond 2

How many times will the change in temperature from the previous day be greater than 2 degrees?

In [89]:
# by default, the "diff" method compares with the previous element
s[s.diff() > 2]

Tue    25
Wed    31
Sat    25
Tue    22
Thu    27
Tue    27
Thu    22
Sun    23
Wed    31
Fri    20
dtype: int8

# Beyond 3

What are the two most common temperatures, and how often does each appear?

In [90]:
# value_counts returns a pd.Series in which the values from s are 
# the index, the number of appearances is the value, and the
# items are ordered from most common to least common. We can
# then use "head" to get only the 2 most common values.
s.value_counts().head(2)

22    5
19    4
dtype: int64

**문제 6**

In [92]:
s = pd.read_csv('../data/taxi-passenger-count.csv', header=None).squeeze()

s.value_counts(normalize=True)[[1, 6]]

FileNotFoundError: [Errno 2] No such file or directory: '../data/taxi-passenger-count.csv'

In [2]:
s.value_counts()

1    7207
2    1313
5     520
3     406
6     369
4     182
0       2
Name: 0, dtype: int64

In [1]:
import numpy as np
import pandas as pd
from pandas import pd.Series, pd.DataFrame

s = pd.read_csv('../data/taxi-passenger-count.csv', header=None).squeeze()

# Beyond 1

What are the 25%, 50% (median), and 75% quantiles for this data set? Can you guess the results before you execute the code?

In [2]:
# Since 1-passenger rides are 72% of the values, we can
# guess that the 25% and 50% marks will be 1, whereas 
# the 75% mark will be 2 or 3, depending on how common those are.
s.quantile([.25, .50, .75])

0.25    1.0
0.50    1.0
0.75    2.0
Name: 0, dtype: float64

# Beyond 2

What proportion of taxi rides are for 3, 4, 5, or 6 passengers?

In [3]:
s.value_counts(normalize=True)[[3,4,5,6]].sum()

0.1477147714771477

# Beyond 3

Consider that you're in charge of vehicle licensing for New York taxis. Given these numbers, would more people benefit from smaller taxis that can take only one or two passengers, or larger taxis that can take five or six passengers?

Given that a huge proportion of rides are for 1 or 2 passengers, licensing more small taxis would seem to match the needs.

**문제 7**

In [8]:
import numpy as np
import pandas as pd
from pandas import pd.Series, pd.DataFrame

s = pd.read_csv('../data/taxi-distance.csv', header=None).squeeze()

In [11]:
pd.cut(s, bins=[s.min()-1, 2, 10, s.max()], 
       labels=['short', 'medium', 'long']).value_counts()

short     5890
medium    3402
long       707
Name: 0, dtype: int64

# Beyond 1

Compare the mean and median trip distances. What does that tell you about the distribution of our data?

In [2]:
s.describe()

count    9999.000000
mean        3.158511
std         4.037516
min         0.000000
25%         1.000000
50%         1.700000
75%         3.300000
max        64.600000
Name: 0, dtype: float64

Because the mean is significantly higher than the median, it would seem that there are some *very* long trips in our data set that are pulling the mean up. And sure enough, we see that the standard deviation is 4, but that we have at least one trip > 64 miles in length.

# Beyond 2

How many short, medium, and long trips were there for trips that had only one passenger? Note that data for passenger count and trip length are from the same data set, meaning that the indexes are the same.

In [3]:
passenger_count = pd.read_csv('../data/taxi-passenger-count.csv', header=None).squeeze()

pd.cut(s[passenger_count == 1], 
       bins=[s.min(), 2, 10, s.max()], 
       labels=['short', 'medium', 'long']).value_counts()

short     4285
medium    2387
long       487
Name: 0, dtype: int64

# Beyond 3

What happens if we don't pass explicit intervals, and instead ask `pd.cut` to just create 3 bins, with `bins=3`?

In [4]:
passenger_count = pd.read_csv('../data/taxi-passenger-count.csv', header=None).squeeze()

pd.cut(s[passenger_count == 1], 
       bins=3,
       labels=['short', 'medium', 'long'], retbins=True)

(0       short
 1       short
 2       short
 3       short
 4       short
         ...  
 9993    short
 9994    short
 9995    short
 9996    short
 9998    short
 Name: 0, Length: 7207, dtype: category
 Categories (3, object): ['short' < 'medium' < 'long'],
 array([-0.0646    , 21.53333333, 43.06666667, 64.6       ]))

`pd.cut` took the interval from `s.min()` to `s.max()`, divided it into three equal parts, and assigned those to be `short`, `medium`, and `long`. We can see, though, that this meant our `long` category is from 43 miles to 64.6 miles -- numerically one-third of the values' interval, but only including a handful of values!