## There are alternative solution and hits: 
### more readable
> 10, 18, 23, 24, 25
### more effient (vectorlized)
> 16, 19, 24
### when to use it?
> 13, 26
### hints
> 12, 18, 22, 24

In [1]:
import pandas as pd

In [2]:
# 1. How to import pandas and check the version? 
print(pd.__version__)
print(pd.show_versions(as_json=True))

0.24.2
{'system': {'commit': None, 'python': '3.6.5.final.0', 'python-bits': 64, 'OS': 'Darwin', 'OS-release': '16.7.0', 'machine': 'x86_64', 'processor': 'i386', 'byteorder': 'little', 'LC_ALL': 'None', 'LANG': 'zh_TW.UTF-8', 'LOCALE': 'zh_TW.UTF-8'}, 'dependencies': {'pandas': '0.24.2', 'pytest': None, 'pip': '9.0.3', 'setuptools': '39.0.1', 'Cython': None, 'numpy': '1.16.4', 'scipy': '1.3.0', 'pyarrow': None, 'xarray': None, 'IPython': '7.5.0', 'sphinx': None, 'patsy': None, 'dateutil': '2.8.0', 'pytz': '2019.1', 'blosc': None, 'bottleneck': None, 'tables': None, 'numexpr': None, 'feather': None, 'matplotlib': '3.1.1', 'openpyxl': None, 'xlrd': None, 'xlwt': None, 'xlsxwriter': None, 'lxml.etree': None, 'bs4': None, 'html5lib': None, 'sqlalchemy': None, 'pymysql': None, 'psycopg2': None, 'jinja2': '2.10.1', 's3fs': None, 'fastparquet': None, 'pandas_gbq': None, 'pandas_datareader': None, 'gcsfs': None}}
None


In [3]:
# 2. How to create a series from a list, numpy array and dict?
import numpy as np
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

ser1 = pd.Series(mylist)
ser2 = pd.Series(myarr)
ser3 = pd.Series(mydict)

In [4]:
# 3. How to convert the index of a series into a column of a dataframe?
# L1
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
ser = pd.Series(mydict)

df = ser.to_frame().reset_index()
df.head()

Unnamed: 0,index,0
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4


In [5]:
# 4. How to combine many series to form a dataframe?
# L1
import numpy as np
ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))
ser2 = pd.Series(np.arange(26))

df = pd.concat([ser1, ser2], axis=1)
df.head()

Unnamed: 0,0,1
0,a,0
1,b,1
2,c,2
3,e,3
4,d,4


In [6]:
# 5. How to assign name to the series’ index?
# L1
ser = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))

ser.name = 'alphabets'

In [7]:
# 6. How to get the items of series A not present in series B?
# L2
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

mask = ~ ser1.isin(ser2)
ser1[mask]

0    1
1    2
2    3
dtype: int64

In [8]:
# 7. How to get the items not common to both series A and series B?
# L2
ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

union_ser = pd.Series(np.union1d(ser1, ser2))
intersection_ser = pd.Series(np.intersect1d(ser1, ser2))
xor_ser = union_ser[~ union_ser.isin(intersection_ser)]
xor_ser

0    1
1    2
2    3
5    6
6    7
7    8
dtype: int64

In [9]:
# 8. How to get the minimum, 25th percentile, median, 75th, and max of a numeric series?
# L2
ser = pd.Series(np.random.normal(10, 5, 25))
ser.quantile([.0, .25, .5, .75, 1])

0.00    -2.816515
0.25     3.814005
0.50     7.848669
0.75    13.588815
1.00    22.202684
dtype: float64

In [10]:
# 9. How to get frequency counts of unique items of a series?
#  L1
ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))
ser.value_counts()

b    8
a    5
d    4
e    4
f    3
c    3
g    2
h    1
dtype: int64

In [11]:
# 10. How to keep only top 2 most frequent values as it is and replace everything else as ‘Other’?
# L2
np.random.RandomState(100)
ser = pd.Series(np.random.randint(1, 5, [12]))

# ALTERNATIVE ANSWER
# More Readable
top_2_frequent = ser.value_counts().nlargest(2).index
ser.where(ser.isin(top_2_frequent), other='Other')

0         4
1     Other
2         4
3         4
4         4
5     Other
6         4
7         3
8     Other
9         4
10        3
11    Other
dtype: object

In [12]:
#  11. How to bin a numeric series to 10 groups of equal size?
#  L2
ser = pd.Series(np.random.random(20))
# Note ourput dtype is category
pd.qcut(ser,
        q = [0, .1, .2, .3, .4, .5, .6, .7, .8, .9, 1],
        labels=['1st', '2nd','3rd','4th','5th',
                 '6th', '7th', '8th','9th', '10th'])

0      5th
1      4th
2      9th
3      1st
4      6th
5      3rd
6      6th
7      2nd
8      8th
9      4th
10     3rd
11     1st
12    10th
13     7th
14     2nd
15     5th
16    10th
17     9th
18     8th
19     7th
dtype: category
Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]

In [13]:
# 12. How to convert a numpy array to a dataframe of given shape? (L1)
# L1
ser = pd.Series(np.random.randint(1, 10, 35))

# note : argument -1 will caculate the rest 
# in this case (7, -1) -->  (7 , 35 / 7)
pd.DataFrame(ser.values.reshape(7,-1))

Unnamed: 0,0,1,2,3,4
0,2,9,7,9,9
1,3,1,2,9,7
2,9,2,7,3,7
3,5,1,5,7,2
4,1,8,8,6,4
5,2,1,2,9,2
6,1,3,6,5,3


In [14]:
# 13. How to find the positions of numbers that are multiples of 3 from a series?
# L2
ser = pd.Series(np.random.randint(1, 10, 7))

# note : np.where, pd.where return whole series
#        np.argwhere return index
#        use the indrx :  arr = np.argwhere(ser condition), ser.iloc(arr.reshape(-1))
np.argwhere(ser % 3 == 0)

  return getattr(obj, method)(*args, **kwds)


array([[4],
       [5]])

In [15]:
# 14. How to extract items at given positions from a series
# L1
ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]

ser.iloc[pos]

0     a
4     e
8     i
14    o
20    u
dtype: object

In [16]:
# 15. How to stack two series vertically and horizontally ?
# Difficulty Level: L1
ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))

df1 = pd.concat([ser1,ser2], axis=0).to_frame()
df2 = pd.concat([ser1,ser2], axis=1)

In [17]:
# 16. How to get the positions of items of series A in another series B?
# Difficulty Level: L2
ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])

# note : this solution is vectorlized
# faster than list comprehemsion for i in ser2 
# when data is big
np.argwhere(ser1.isin(ser2)).reshape(-1).tolist()

[0, 4, 5, 8]

In [18]:
# 17. How to compute the mean squared error on a truth and predicted series?
# Difficulty Level: L2
truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)

print((truth - pred).pow(2).sum() / len(truth))
print(np.mean((truth - pred) ** 2))

0.262621875162554
0.262621875162554


In [19]:
# 18. How to convert the first character of each element in a series to uppercase?
# Difficulty Level: L2
ser = pd.Series(['how', 'to', 'kick', 'ass?'])

# More readable
ser.str.capitalize()
# Hints
# you can use dir() to get all the method inside ser.str
# Now you could faster understand how ser.str can do
# print(dir(ser.str))

0     How
1      To
2    Kick
3    Ass?
dtype: object

In [20]:
# 19. How to calculate the number of characters in each word in a series?
# Difficulty Level: L2
ser = pd.Series(['how', 'to', 'kick', 'ass?'])

# vectorlzied

ser.str.len()

0    3
1    2
2    4
3    4
dtype: int64

In [21]:
# 20. How to compute difference of differences between consequtive numbers of a series?
# Difficulty Level: L1
ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

# Solition 1
tmp = ser.shift(1)
middle_result = ser - tmp
print(middle_result.tolist())
tmp2 = middle_result.shift(1)
print((middle_result - tmp2).tolist())

# Solution 2
print('-'*60)
print(ser.diff().tolist())
print(ser.diff().diff().tolist())

[nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]
[nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]
------------------------------------------------------------
[nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]
[nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]


In [26]:
# 21. How to convert a series of date-strings to a timeseries?
# Difficiulty Level: L2

ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

pd.to_datetime(ser, infer_datetime_format=True)

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
dtype: datetime64[ns]

In [75]:
# 22. How to get the day of month, week number, day of year and day of week from a series of date strings?
# Difficiulty Level: L2

ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

tmp = pd.to_datetime(ser)
# hint
# use dir(tmp.dt) to check it out what could be called
# tmp.dt is pandas.core.indexes.accessors.DatetimeProperties object
# use dir(tmp[0]) to check it out what could be called 
# tmp[0] is a single element, <class 'pandas._libs.tslibs.timestamps.Timestamp'>
# you might wanna to check
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html
# for Timestamp documentation
# https://dateutil.readthedocs.io/en/stable/index.html
# dateutil, really good when we want to dealing with time


Date = tmp.dt.day.tolist()
Week_number = tmp.dt.weekofyear.tolist()
Day_num_of_year = tmp.dt.dayofyear.tolist()
Dayofweek = tmp.dt.weekday_name.tolist()

print(f''' 
Date : {Date}
Week number : {Week_number}
Day num of year : {Day_num_of_year}
Day of week : {Dayofweek}
''')


 
Date : [1, 2, 3, 4, 5, 6]
Week number : [53, 5, 9, 14, 19, 23]
Day num of year : [1, 33, 63, 94, 125, 157]
Day of week : ['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']



In [76]:
# 23. How to convert year-month string to dates corresponding to the 4th day of the month?
# Difficiulty Level: L2
ser = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])

# ALTERNATIVE SOLUTION
# more readable
pd.to_datetime(ser, infer_datetime_format=True)

0   2010-01-01
1   2011-02-01
2   2012-03-01
dtype: datetime64[ns]

In [86]:
# 24. How to filter words that contain atleast 2 vowels from a series?
# Difficiulty Level: L3

ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

# ALTERNATIVE SOLUTION
# more readable
# vectorlized
condition = ser.str.count('[aeiouAEIOU]') >= 2
ser[condition]
# Hint, you cloud use print(dir(ser.str)) to check it out what could be called


0     Apple
1    Orange
4     Money
dtype: object

* hint
    * <img src = "./RegExp_snap.png"></img>
    * pandas中的Series.str方法是向量化的，且通常都支援正則表達式(RegExp)
    * 正則表達式可以幫助我們處理很多文字問題
    * 像圖中的[aeiou]搭配[AEIUO] --> [aeiuoAEIUO]就解決了此題
    * 或許你會想看看這份在[菜鳥上的教學](http://www.runoob.com/python/python-reg-expressions.html)

In [89]:
# 25. How to filter valid emails from a series?
# Difficiulty Level: L3

# Extract the valid emails from the series emails. The regex pattern for valid emails is provided as reference.

emails = pd.Series(['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'

# More readable
condition = emails.str.contains(pattern)
emails[condition].tolist()

['rameses@egypt.com', 'matt@t.co', 'narendra@modi.com']

In [106]:
# 26. How to get the mean of a series grouped by another series?
# Difficiulty Level: L2
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))
print(weights.tolist())
print(fruit.tolist())

print()
print(weights.groupby(fruit).mean())

# ALTERNATIVE SULOTION
tmp = pd.DataFrame({'fruit':fruit, 
              'weights':weights}).groupby('fruit').mean()

tmp['weights'].index.name = ''
print(tmp['weights'])

# When to use?
# 操作dataframe時我們通常都直接在dataframe裡面groupby, 這題告訴我們
# series 可以 groupby 另一條series, 之間用index作為對應, 
# 這讓feature engineering時能夠有更好的彈性

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
['apple', 'apple', 'banana', 'apple', 'apple', 'apple', 'apple', 'apple', 'banana', 'banana']

apple     4.714286
banana    7.333333
dtype: float64

apple     4.714286
banana    7.333333
Name: weights, dtype: float64
