### Want to read a HUGE dataset into pandas but don't have enough memory?
##### Randomly sample the dataset *during file reading* by passing a function to "skiprows


In [112]:
import pandas as pd
import numpy as np

In [113]:
df = pd.read_csv('survey_results_public.csv')
df.shape

(88883, 85)

### How it works:
  - <b>skiprows</b> accepts a function that is evaluated aganist the integer index 
  - <b>x > 0</b> ensures that the header row is <b>not</b> skipped
  - <b>np.random.rand()>0.01 </b> return <b>True</b> 99% of the time,thus skipping 99% of the rows

In [114]:
df = pd.read_csv('survey_results_public.csv',skiprows = lambda x:x > 0 and np.random.rand() > 0.01)
df.shape

(887, 85)

In [115]:

%%html
<style type ="text/css">
table.dataframe th ,table.dataframe td
{
    border:1px;
    border-style:solid;
}

In [116]:
df.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,213,I am a developer by profession,Yes,Never,The quality of OSS and closed source software ...,Employed full-time,United Arab Emirates,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,41.0,Man,No,Straight / Heterosexual,Middle Eastern,Yes,Too long,Easy
1,255,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Independent contractor, freelancer, or self-em...",Thailand,"Yes, part-time","Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,20.0,Man,Yes,Straight / Heterosexual,East Asian,Yes,Too long,Neither easy nor difficult
2,312,I am a developer by profession,No,Never,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)",Mathematics or statistics,...,Somewhat more welcome now than last year,Tech articles written by other developers,31.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
3,373,I am a developer by profession,Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Brazil,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,37.0,Man,No,Straight / Heterosexual,Hispanic or Latino/Latina;White or of European...,No,Appropriate in length,Easy
4,507,"I am not primarily a developer, but I write co...",Yes,Once a month or more often,The quality of OSS and closed source software ...,"Independent contractor, freelancer, or self-em...",Italy,"Yes, full-time","Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Somewhat more welcome now than last year,,24.0,Man,No,,White or of European descent,No,Appropriate in length,Neither easy nor difficult


### Do you sometimes end up with an "Unnamed: 0" column in your DataFrame? 
#### Solution: Set the first column as the index (when reading)
#### Alternative: Don't save the index to the file (when writing)

In [117]:
dummy = pd.DataFrame({'A':[0.0,1.0,2.0],'B':[0.0,1.0,0.0],'C':['foo1','foo2','foo3']})
dummy

Unnamed: 0,A,B,C
0,0.0,0.0,foo1
1,1.0,1.0,foo2
2,2.0,0.0,foo3


In [118]:
dummy.to_csv('file.csv')

In [119]:
pd.read_csv('file.csv')


Unnamed: 0.1,Unnamed: 0,A,B,C
0,0,0.0,0.0,foo1
1,1,1.0,1.0,foo2
2,2,2.0,0.0,foo3


In [120]:
#Solution :Set the first column as the index
pd.read_csv('file.csv',index_col=0)

Unnamed: 0,A,B,C
0,0.0,0.0,foo1
1,1.0,1.0,foo2
2,2.0,0.0,foo3


In [121]:
##Alterntaive:Don't write the index to the file
df.to_csv('file1.csv',index=False)
pd.read_csv('file1.csv')

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,213,I am a developer by profession,Yes,Never,The quality of OSS and closed source software ...,Employed full-time,United Arab Emirates,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,41.0,Man,No,Straight / Heterosexual,Middle Eastern,Yes,Too long,Easy
1,255,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Independent contractor, freelancer, or self-em...",Thailand,"Yes, part-time","Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,20.0,Man,Yes,Straight / Heterosexual,East Asian,Yes,Too long,Neither easy nor difficult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,40556,,Yes,Never,The quality of OSS and closed source software ...,"Not employed, but looking for work",United States,"Yes, part-time",Some college/university study without earning ...,"Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Tech...,35.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
886,49705,,Yes,Once a month or more often,The quality of OSS and closed source software ...,Employed part-time,United States,"Yes, full-time",Some college/university study without earning ...,"Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,21.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy


### Problem: Your DataFrame is in "wide format" (lots of columns), but you need it in "long format" (lots of rows)
#### Solution: Use melt()

In [122]:
#wide format
df = pd.DataFrame({'zip_code':[12345,34566,98745],'factory':[100,200,300],'warehouse':[100,200,300],'retail':[100,200,300]})
df

Unnamed: 0,zip_code,factory,warehouse,retail
0,12345,100,100,100
1,34566,200,200,200
2,98745,300,300,300


In [123]:
#long format
df.melt(id_vars='zip_code',
       var_name='location_type',
       value_name='distance')

Unnamed: 0,zip_code,location_type,distance
0,12345,factory,100
1,34566,factory,200
2,98745,factory,300
3,12345,warehouse,100
4,34566,warehouse,200
5,98745,warehouse,300
6,12345,retail,100
7,34566,retail,200
8,98745,retail,300


### Want to convert "year" and "day of year" into a single datetime column? 
1. Combine them into one number
2. Convert to datetime and specify its format

In [124]:
df = pd.DataFrame({'year':[2019,2019,2020],'day_of_year':[350,365,1]})

In [125]:
df

Unnamed: 0,year,day_of_year
0,2019,350
1,2019,365
2,2020,1


In [126]:
#step 1:

df['combined'] = df['year'] * 1000  + df['day_of_year']
df

Unnamed: 0,year,day_of_year,combined
0,2019,350,2019350
1,2019,365,2019365
2,2020,1,2020001


In [127]:
##step 2:
df['date'] = pd.to_datetime(df['combined'], format ='%Y%j')
df

Unnamed: 0,year,day_of_year,combined,date
0,2019,350,2019350,2019-12-16
1,2019,365,2019365,2019-12-31
2,2020,1,2020001,2020-01-01


### Want to create interactive plots using pandas 0.25? 📊
1. Pick one:
➡️ pip install hvplot
➡️ conda install -c conda-forge hvplot

2. pd.options.plotting.backend = 'hvplot'
3. df.plot(...)

#### Want to know the *count* of missing values in a DataFrame?
➡️ df.isna().sum().sum()

Just want to know if there are *any* missing values?
➡️ df.isna().any().any()
➡️ df.isna().any(axis=None)

In [128]:
df =pd.read_csv('http://bit.ly/imdbratings')
df.head(2)

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"


In [129]:
#count of missing values in *each* column
df.isna().sum()

star_rating       0
title             0
content_rating    3
genre             0
duration          0
actors_list       0
dtype: int64

In [130]:
#count of missing values  *total*
df.isna().sum().sum()

3

In [131]:
## are there missing values in each column?
df.isna().any()


star_rating       False
title             False
content_rating     True
genre             False
duration          False
actors_list       False
dtype: bool

In [132]:
## are there missing values in any column?
df.isna().any().any()

True

In [133]:
## Alternative solutions
df.isna().any(axis=None)


True

### Want to save a *massive* amount of memory? Fix your data types:
#### ➡️ 'int8' for small integers
#### ➡️ 'category' for strings with few unique values
#### ➡️ 'Sparse' if most values are 0 or NaN

In [134]:
df1 = pd.DataFrame({'Pclass':[1,3,3,1,1],'sex':['male','female','female','female','male'],'Parch':[0,0,0,0,0],'Cabin':['NaN','C85','NaN','C123','NaN']})

In [135]:
df1

Unnamed: 0,Pclass,sex,Parch,Cabin
0,1,male,0,
1,3,female,0,C85
2,3,female,0,
3,1,female,0,C123
4,1,male,0,


In [136]:
df1.memory_usage(deep=True)

Index     128
Pclass     40
sex       311
Parch      40
Cabin     301
dtype: int64

In [137]:
df1 =df1.astype({'Pclass':'int8',   #only values are 1/2/3
              'sex':'category',   #only values are male/female
              'Parch':'Sparse[int]', #most values are 0
              'Cabin':'Sparse[str]'}) #most values are NaN

In [138]:
df1.memory_usage(deep=True)

Index     128
Pclass      5
sex       209
Parch       0
Cabin     321
dtype: int64

### Want to combine the small categories in a Series (<10% frequency) into a single category?
1. Save the normalized value counts
2. Filter by frequency & save the index
3. Replace small categories with "Other"

In [139]:
df.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [140]:
#step:1
frequency = df['genre'].value_counts(normalize=True)
frequency

Drama        0.283963
Comedy       0.159346
Action       0.138917
Crime        0.126660
Biography    0.078652
Adventure    0.076609
Animation    0.063330
Horror       0.029622
Mystery      0.016343
Western      0.009193
Thriller     0.005107
Sci-Fi       0.005107
Film-Noir    0.003064
Family       0.002043
Fantasy      0.001021
History      0.001021
Name: genre, dtype: float64

In [141]:
#step 2:
small_categories = frequency[frequency < 0.10].index
small_categories

Index(['Biography', 'Adventure', 'Animation', 'Horror', 'Mystery', 'Western',
       'Thriller', 'Sci-Fi', 'Film-Noir', 'Family', 'Fantasy', 'History'],
      dtype='object')

In [142]:
#step 3:
genre_updated = df['genre'].replace(small_categories ,'Others')

In [143]:
#result
genre_updated.value_counts(normalize=True)

Others    0.291113
Drama     0.283963
Comedy    0.159346
Action    0.138917
Crime     0.126660
Name: genre, dtype: float64

### Need to clean an object column with mixed data types? Use "replace" (not str.replace) and regex!

In [144]:
df2 = pd.DataFrame({'Customer':['A','B','C','D'],'sales':[1100,950.75,'$800.00','$1,250.25']})

In [145]:
#mixed data types in the sales column
df2['sales'].apply(type)

0      <class 'int'>
1    <class 'float'>
2      <class 'str'>
3      <class 'str'>
Name: sales, dtype: object

In [146]:
#replace dollar sign or comma witj an empty string
df2['sales'] = df2['sales'].replace('[$,]','',regex=True).astype('float')

In [147]:
df2

Unnamed: 0,Customer,sales
0,A,1100.0
1,B,950.75
2,C,800.0
3,D,1250.25


In [148]:
df2['sales'].apply(type)

0    <class 'float'>
1    <class 'float'>
2    <class 'float'>
3    <class 'float'>
Name: sales, dtype: object

### Need to create a time series dataset for testing? Use pd.util.testing.makeTimeDataFrame()
#### Need more control over the columns & data? Generate data with np.random & overwrite index with makeDateIndex()

In [149]:
num_rows = 366 * 24 #hours in a leap yeat
pd.util.testing.makeTimeDataFrame(num_rows,freq='H')

Unnamed: 0,A,B,C,D
2000-01-01 00:00:00,0.384156,0.854345,-0.312709,-0.837820
2000-01-01 01:00:00,0.295438,-0.833476,1.383553,2.599314
...,...,...,...,...
2000-12-31 22:00:00,0.122008,-1.187613,1.060822,-1.619654
2000-12-31 23:00:00,0.962942,0.501121,-1.290945,-0.648113


In [150]:
num_cols = 2
cols =['sales','customers']
df3 = pd.DataFrame(np.random.randint(1,20,size=(num_rows,num_cols)),columns=cols)
df3.index = pd.util.testing.makeDateIndex(num_rows,freq='H')
df3

Unnamed: 0,sales,customers
2000-01-01 00:00:00,14,18
2000-01-01 01:00:00,15,8
...,...,...
2000-12-31 22:00:00,6,6
2000-12-31 23:00:00,19,2


### Want to insert a new column into a DataFrame at a specific location? Use the "insert" method:
#### df.insert(location, name, value)


In [151]:
df4 = pd.DataFrame({'A':[15,16],'B':[59,33],'C':[22,44],'D':[33,66],'E':[13,33],'F':[22,10]})

In [152]:
df4

Unnamed: 0,A,B,C,D,E,F
0,15,59,22,33,13,22
1,16,33,44,66,33,10


### insert a new column at a specific location

In [153]:
df4.insert(3,'C2',df4['C'] * 2)
df4

Unnamed: 0,A,B,C,C2,D,E,F
0,15,59,22,44,33,13,22
1,16,33,44,88,66,33,10


### Alternative:Create a new columns and then move

In [154]:
df4['C3'] = df4['C'] * 3
cols = list(df4.columns)
location = 4
cols = cols[:location] + ['C3'] + cols[location:-1]
df4 = df4[cols]
df4

Unnamed: 0,A,B,C,C2,C3,D,E,F
0,15,59,22,44,66,33,13,22
1,16,33,44,88,132,66,33,10


### Need to split names of variable length into first_name & last_name?
#### 1. Use str.split(n=1) to split only once (returns a Series of lists)
#### 2. Chain str[0] and str[1] on the end to select the list elements

In [155]:
df5 = pd.DataFrame({'name':['Geordi La Forge','Deanna Troi','Data']})

In [156]:
df5

Unnamed: 0,name
0,Geordi La Forge
1,Deanna Troi
2,Data


In [157]:
df5['first_name'] = df5['name'].str.split(n=1).str[0]
df5['last_name'] = df5['name'].str.split(n=1).str[1]

In [158]:
df5['first_name']

0    Geordi
1    Deanna
2      Data
Name: first_name, dtype: object

In [159]:
df5['last_name']

0    La Forge
1        Troi
2         NaN
Name: last_name, dtype: object

In [160]:
df5

Unnamed: 0,name,first_name,last_name
0,Geordi La Forge,Geordi,La Forge
1,Deanna Troi,Deanna,Troi
2,Data,Data,


### Goal: Rearrange the columns in your DataFrame
#### Options:
1. Specify all column names in desired order
2. Specify columns to move, followed by remaining columns
3. Specify column positions in desired order

In [161]:
df6 = pd.DataFrame(np.random.rand(2,8),columns=list('abcdefgh'))
df6

Unnamed: 0,a,b,c,d,e,f,g,h
0,0.773016,0.14707,0.789791,0.115033,0.149785,0.413434,0.896309,0.540719
1,0.06664,0.986582,0.776251,0.441787,0.585847,0.812809,0.128695,0.477103


### option 1:useful for dataFrames with few columns

In [162]:
cols = ['a','c','e','b','d','f','g','h']
df6[cols]

Unnamed: 0,a,c,e,b,d,f,g,h
0,0.773016,0.789791,0.149785,0.14707,0.115033,0.413434,0.896309,0.540719
1,0.06664,0.776251,0.585847,0.986582,0.441787,0.812809,0.128695,0.477103


### option 2:useful for dataFrames with many columns

In [163]:
cols_to_move = ['a','c','e']
cols = cols_to_move + [col for col in df6 if col not in cols_to_move]
df6[cols]

Unnamed: 0,a,c,e,b,d,f,g,h
0,0.773016,0.789791,0.149785,0.14707,0.115033,0.413434,0.896309,0.540719
1,0.06664,0.776251,0.585847,0.986582,0.441787,0.812809,0.128695,0.477103


### option3:useful for long column names

In [164]:
cols = df6.columns[[0,2,4,1,3,5,6,7]]
df6[cols]

Unnamed: 0,a,c,e,b,d,f,g,h
0,0.773016,0.789791,0.149785,0.14707,0.115033,0.413434,0.896309,0.540719
1,0.06664,0.776251,0.585847,0.986582,0.441787,0.812809,0.128695,0.477103


### Problem: You have time series data that you want to aggregate by day, but you're only interested in weekends.
#### Solution:
1. resample by day ('D')
2. filter by day of week (5=Saturday, 6=Sunday)

In [165]:
num_cols = 2
cols =['hourly_sales','customers']
df3 = pd.DataFrame(np.random.randint(1,20,size=(num_rows,num_cols)),columns=cols)
df3.index = pd.util.testing.makeDateIndex(num_rows,freq='H')
df3

Unnamed: 0,hourly_sales,customers
2000-01-01 00:00:00,3,2
2000-01-01 01:00:00,17,6
...,...,...
2000-12-31 22:00:00,9,19
2000-12-31 23:00:00,2,11


In [166]:
daily_sales = df3.resample('D').hourly_sales.sum()
daily_sales

2000-01-01    298
2000-01-02    201
             ... 
2000-12-30    226
2000-12-31    239
Freq: D, Name: hourly_sales, Length: 366, dtype: int32

In [167]:
weekend_sales = daily_sales[daily_sales.index.dayofweek.isin([5,6])]
weekend_sales

2000-01-01    298
2000-01-02    201
             ... 
2000-12-30    226
2000-12-31    239
Name: hourly_sales, Length: 106, dtype: int32

### Are you applying multiple aggregations after a groupby? Try "named aggregation":
#### ✅ Allows you to name the output columns
#### ❌ Avoids a column MultiIndex

In [168]:
titanic = pd.read_csv('http://bit.ly/kaggletrain')

### Problem :Uninformative column names

In [169]:
titanic.head(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


In [170]:
titanic.groupby('Pclass').Age.agg(['mean','max'])

Unnamed: 0_level_0,mean,max
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,38.233441,80.0
2,29.87763,70.0
3,25.14062,74.0


### Problem :Mutiindex in the columsn

In [171]:
titanic.groupby('Pclass').agg({'Age':['mean','max']})

Unnamed: 0_level_0,Age,Age
Unnamed: 0_level_1,mean,max
Pclass,Unnamed: 1_level_2,Unnamed: 2_level_2
1,38.233441,80.0
2,29.87763,70.0
3,25.14062,74.0


#### Solution:'Named aggregation"(new in pandas 0.25)

In [172]:
titanic.groupby('Pclass').Age.agg(mean_age='mean',max_age='max')

Unnamed: 0_level_0,mean_age,max_age
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,38.233441,80.0
2,29.87763,70.0
3,25.14062,74.0


### Three useful ways to convert one set of values to another:
1. map() using a dictionary
2. factorize() to encode each value as an integer
3. comparison statement to return boolean values

In [173]:
df7 = pd.DataFrame({'gender':['male','female','male','female'],'color':['red','green','red','blue'],'age':[25,40,10,20]})
df7

Unnamed: 0,gender,color,age
0,male,red,25
1,female,green,40
2,male,red,10
3,female,blue,20


In [174]:
df7['gender_letter'] = df7['gender'].map({'male':'M','female':'F'})
df7['color_num'] = df7['color'].factorize()[0]
df7['can_vote'] = df7['age'] >=18

In [175]:
df7

Unnamed: 0,gender,color,age,gender_letter,color_num,can_vote
0,male,red,25,M,0,True
1,female,green,40,F,1,True
2,male,red,10,M,0,False
3,female,blue,20,F,2,True


### My favorite feature in pandas 0.25: If DataFrame has more than 60 rows, only show 10 rows (saves your screen space!)
#### You can modify this: pd.set_option('min_rows', 4)

In [176]:
#this trik requires pandas 0.25_1+
pd.__version__

'0.25.1'

In [177]:
#New: if DataFrame has more than 60 rows ,only show 10 rows(by default)
pd.DataFrame(np.random.rand(100,3))

Unnamed: 0,0,1,2
0,0.625166,0.446367,0.193629
1,0.365590,0.350564,0.846803
...,...,...,...
98,0.524523,0.008506,0.735020
99,0.332431,0.194477,0.919665


In [178]:
# #modify this optiojn to only 4 rows instead
pd.set_option('min_rows',4)




In [179]:
pd.DataFrame(np.random.rand(100,3))


Unnamed: 0,0,1,2
0,0.664173,0.055384,0.090080
1,0.444033,0.994652,0.310982
...,...,...,...
98,0.139897,0.689317,0.491630
99,0.751944,0.262472,0.541078


### Problem: Your dataset has many columns and you want to ensure that each one has the correct data type.
#### Solution:
1. Create a CSV of column names & dtypes
2. Read it into a DataFrame
3. Convert it to dictionary
4. Use the dictionary to specify dtypes of the dataset

In [180]:
#step1:Create a CSV of column names & dtypes
drinks = pd.read_csv('http://bit.ly/drinksbycountry')


In [181]:
drinks.columns

Index(['country', 'beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')

In [182]:
drinks.dtypes

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [183]:
df9 = pd.DataFrame(drinks.dtypes)

df9


# df9.to_csv('drinks_dtypes.csv',index_label=colu_name,columns=col_dtype)

Unnamed: 0,0
country,object
beer_servings,int64
spirit_servings,int64
wine_servings,int64
total_litres_of_pure_alcohol,float64
continent,object


In [184]:
!type drinks_dtypes.csv

col_name,col_dtype
country,object
beer_servings,int64
spirit_servings,int64
wine_servings,int64
total_litres_of_pure_alcohol,float64
continent,object


In [185]:
##step 2:read it into a Dataframe
dtypes = pd.read_csv('drinks_dtypes.csv').set_index('col_name')
dtypes

Unnamed: 0_level_0,col_dtype
col_name,Unnamed: 1_level_1
country,object
beer_servings,int64
spirit_servings,int64
wine_servings,int64
total_litres_of_pure_alcohol,float64
continent,object


In [186]:
##step 3:convert it to dicitonary
dtypes_dict = dtypes['col_dtype'].to_dict()
dtypes_dict

{'country': 'object',
 'beer_servings': 'int64',
 'spirit_servings': 'int64',
 'wine_servings': 'int64',
 'total_litres_of_pure_alcohol': 'float64',
 'continent': 'object'}

In [187]:
##step 4: Use the dictinary to sepicify the dtypes
drinks = pd.read_csv('http://bit.ly/drinksbycountry',dtype=dtypes_dict)
drinks.dtypes

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

### Want to select from a DataFrame by label *and* position?
#### Most readable approach is to chain "loc" (selection by label) and "iloc" (selection by position).

In [188]:
# drinks.set_index('country')
drinks.head()


Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [189]:
drinks.iloc[15:20,:].loc[:,'beer_servings':'wine_servings']

Unnamed: 0,beer_servings,spirit_servings,wine_servings
15,142,373,42
16,295,84,212
17,263,114,8
18,34,4,13
19,23,0,0


### Does your object column contain mixed data types? Use df.col.apply(type).value_counts() to check!

In [190]:
df10 = pd.DataFrame({'customer':['A','B','C','D'],'sales':['100',150.75,'200','250.25']})

In [191]:
df10


Unnamed: 0,customer,sales
0,A,100.0
1,B,150.75
2,C,200.0
3,D,250.25


In [192]:
df10.dtypes

customer    object
sales       object
dtype: object

In [193]:
df10['sales'].apply(type)

0      <class 'str'>
1    <class 'float'>
2      <class 'str'>
3      <class 'str'>
Name: sales, dtype: object

In [194]:
df10['sales'].apply(type).value_counts()

<class 'str'>      3
<class 'float'>    1
Name: sales, dtype: int64

### Want to select multiple slices of columns from a DataFrame?
1. Use df.loc to select & pd.concat to combine
2. Slice df.columns & select using brackets
3. Use np.r_ to combine slices & df.iloc to select


In [195]:
df11 = pd.DataFrame(np.random.rand(3,11),columns=list('ABCDEFGHIJK'))
df11

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,K
0,0.766552,0.241195,0.626703,0.041254,0.502748,0.276639,0.538752,0.175962,0.172959,0.242691,0.199936
1,0.199139,0.159626,0.328952,0.334639,0.690515,0.343181,0.279096,0.924473,0.823501,0.034731,0.135809
2,0.763727,0.237788,0.595209,0.393916,0.461741,0.401821,0.914212,0.97559,0.999193,0.574706,0.211213


In [196]:
#option:1
# pd.concat([df11.loc[:,'A':'C'],df11.loc[:,'F'],df11.loc[:,'J':'K']],axis='columns')
pd.concat([df11.loc[:,'B':'D'],df11.loc[:,'G'],df11.loc[:,'A':'E']],axis='columns')

Unnamed: 0,B,C,D,G,A,B.1,C.1,D.1,E
0,0.241195,0.626703,0.041254,0.538752,0.766552,0.241195,0.626703,0.041254,0.502748
1,0.159626,0.328952,0.334639,0.279096,0.199139,0.159626,0.328952,0.334639,0.690515
2,0.237788,0.595209,0.393916,0.914212,0.763727,0.237788,0.595209,0.393916,0.461741


In [197]:
#option  2:
df11[list(df11.columns[0:3]) + list(df11.columns[5]) + list(df11.columns[9:11])]

Unnamed: 0,A,B,C,F,J,K
0,0.766552,0.241195,0.626703,0.276639,0.242691,0.199936
1,0.199139,0.159626,0.328952,0.343181,0.034731,0.135809
2,0.763727,0.237788,0.595209,0.401821,0.574706,0.211213


In [198]:
## option:3
df11.iloc[:,np.r_[0:3,5,9:11]]

Unnamed: 0,A,B,C,F,J,K
0,0.766552,0.241195,0.626703,0.276639,0.242691,0.199936
1,0.199139,0.159626,0.328952,0.343181,0.034731,0.135809
2,0.763727,0.237788,0.595209,0.401821,0.574706,0.211213


### Want to know the *count* of rows that match a condition? Use: (condition).sum()
Want to know the *percentage* of rows that match a condition? Use: (condition).mean()


In [199]:
movies = pd.read_csv('http://bit.ly/imdbratings')
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [200]:
movies.shape

(979, 6)

### Number of rows that match a condition

In [201]:
(movies['content_rating'] == 'PG').sum()

123

In [202]:
(movies['duration'] <90).sum()

72

### Percentage of rows that match a condition

In [203]:
(movies['content_rating'] == 'PG').mean()

0.12563840653728295

In [204]:
(movies['duration'] < 90).mean()

0.0735444330949949

### Do you need to build a DataFrame from multiple files, but also keep track of which row came from which file?
1. List files with glob() & sort the results
2. Read files with generator expression, create "filename" column with assign(), & combine DataFrames with concat()

In [205]:
from glob import glob

### Use glob() to list all filed that match a pattern & sort the results

In [206]:
stock_files = sorted(glob('stocks*.csv'))
stock_files

['stocks1.csv', 'stocks2.csv', 'stocks3.csv']

### Use a generator expression to read the files ,assign() to create a new column, & concat() to combine the dataFrames

In [207]:
pd.concat((pd.read_csv(file).assign(filename=file) for file in stock_files),ignore_index=True)

Unnamed: 0,Symbol,Date,Close,Volume,filename
0,AAPL,2016-10-03,112.52,21701800,stocks1.csv
1,MSFT,2016-10-03,57.42,19189500,stocks1.csv
2,CSCO,2016-10-03,31.5,14070500,stocks1.csv
3,CSCO,2016-10-04,31.35,18460400,stocks2.csv
4,AAPL,2016-10-04,113.0,29736800,stocks2.csv
5,MSFT,2016-10-04,57.24,20085900,stocks2.csv
6,CSCO,2016-10-05,31.59,11808600,stocks3.csv
7,AAPL,2016-10-05,113.05,21453100,stocks3.csv
8,MSFT,2016-10-05,57.64,16726400,stocks3.csv


### Want to combine the smaller categories in a Series into a single category called "Other"?
1. Save the index of the largest values of value_counts()
2. Use where() to replace all other values in the Series with "Other"

In [208]:
df12 = df

In [209]:
df12

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
...,...,...,...,...,...,...
977,7.4,Poltergeist,PG,Horror,114,"[u'JoBeth Williams', u""Heather O'Rourke"", u'Cr..."
978,7.4,Wall Street,R,Crime,126,"[u'Charlie Sheen', u'Michael Douglas', u'Tamar..."


In [210]:
df12['genre'].value_counts()

Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Adventure     75
Animation     62
Horror        29
Mystery       16
Western        9
Thriller       5
Sci-Fi         5
Film-Noir      3
Family         2
Fantasy        1
History        1
Name: genre, dtype: int64

In [211]:
top_four = df12['genre'].value_counts().nlargest(4).index
top_four

Index(['Drama', 'Comedy', 'Action', 'Crime'], dtype='object')

In [212]:
#Use where() to replace all other values in the Series with "Other"
genre_updated = df12['genre'].where(df12['genre'].isin(top_four),other='Other')

In [213]:
genre_updated.value_counts()

Other     285
Drama     278
Comedy    156
Action    136
Crime     124
Name: genre, dtype: int64

### Want to filter a DataFrame to only include the largest categories?
1. Save the value_counts() output
2. Get the index of its head()
3. Use that index with isin() to filter the DataFrame

In [214]:
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [215]:
counts = movies['genre'].value_counts()
counts

Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Adventure     75
Animation     62
Horror        29
Mystery       16
Western        9
Thriller       5
Sci-Fi         5
Film-Noir      3
Family         2
Fantasy        1
History        1
Name: genre, dtype: int64

In [216]:
largest_categories = counts.head(3).index
largest_categories

Index(['Drama', 'Comedy', 'Action'], dtype='object')

In [217]:
movies[movies['genre'].isin(largest_categories)].head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
5,8.9,12 Angry Men,NOT RATED,Drama,96,"[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals..."
9,8.9,Fight Club,R,Drama,139,"[u'Brad Pitt', u'Edward Norton', u'Helena Bonh..."
11,8.8,Inception,PG-13,Action,148,"[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'..."
12,8.8,Star Wars: Episode V - The Empire Strikes Back,PG,Action,124,"[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi..."


### Need to count the number of words in a Series? Just use a string method to count the spaces and add 1!

In [218]:
df13 = pd.DataFrame({'message':['pandas is awesome!','i love panda bears','pandas!!!']})
df13

Unnamed: 0,message
0,pandas is awesome!
1,i love panda bears
2,pandas!!!


In [219]:
df13['word_count'] = df13['message'].str.count(' ') + 1

In [220]:
df13

Unnamed: 0,message,word_count
0,pandas is awesome!,3
1,i love panda bears,4
2,pandas!!!,1


### Are you scraping a webpage using read_html(), but it returns too many tables? 😕
#### Use the 'match' parameter to find tables that contain a particular string!

In [221]:
url = 'http://en.wikipedia.org/wiki/Twitter'

In [222]:
tables = pd.read_html(url)
len(tables)

15

In [223]:
matching_tables = pd.read_html(url,match='Followers')
len(matching_tables)

1

In [224]:
matching_tables[0]

Unnamed: 0,Rank,Change (monthly),Account name,Owner,Followers (millions),Activity,Country
0,1,,@BarackObama,Barack Obama,113,Former U.S. president,USA
1,2,,@justinbieber,Justin Bieber,109,Musician,CAN
2,3,,@katyperry,Katy Perry,108,Musician,USA
3,4,,@rihanna,Rihanna,96,Musician and businesswoman,BAR
4,5,,@taylorswift13,Taylor Swift,86,Musician,USA
5,6,,@Cristiano,Cristiano Ronaldo,83,Footballer,POR
6,7,,@ladygaga,Lady Gaga,81,Musician and actress,USA
7,8,,@TheEllenShow,Ellen DeGeneres,79,Comedian,USA
8,9,,@realDonaldTrump,Donald Trump,72,Current U.S. president,USA
9,10,,@YouTube,YouTube,72,Online video platform,USA


### Need to remove a column from a DataFrame and store it as a separate Series? Use "pop"! 🍾

In [225]:
iris = pd.read_csv('iris.csv')

In [226]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [227]:
label = iris.pop('species')
label

0         setosa
1         setosa
         ...    
148    virginica
149    virginica
Name: species, Length: 150, dtype: object

In [228]:
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
...,...,...,...,...
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


### Need to convert a column from continuous to categorical?
➡️ Use cut() to specify bin edges
➡️ Use qcut() to specify number of bins (creates bins of approximately equal size)
➡️ Both allow you to label the bins

In [229]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [230]:
titanic['Age'].head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

In [231]:
## Cut() :you specific bin edges

pd.cut(titanic['Age'],bins=[0,18,25,99]).head()

0    (18, 25]
1    (25, 99]
2    (25, 99]
3    (25, 99]
4    (25, 99]
Name: Age, dtype: category
Categories (3, interval[int64]): [(0, 18] < (18, 25] < (25, 99]]

In [232]:
###qcut():You specify number of bins
pd.qcut(titanic['Age'],q=3).head()

0    (0.419, 23.0]
1     (34.0, 80.0]
2     (23.0, 34.0]
3     (34.0, 80.0]
4     (34.0, 80.0]
Name: Age, dtype: category
Categories (3, interval[float64]): [(0.419, 23.0] < (23.0, 34.0] < (34.0, 80.0]]

In [233]:
### cut() and qcut() both allow bin labels
pd.qcut(titanic['Age'],q=3,labels=['child','young adult','adult']).head()


0          child
1          adult
2    young adult
3          adult
4          adult
Name: Age, dtype: category
Categories (3, object): [child < young adult < adult]

### Want to extract tables from a PDF into a DataFrame? Try tabula-py!


In [234]:
#from tabula import read_pdf
# df =read_pdf('test.pdf',pages='all')

### Need to know which version of pandas you're using?
➡️ pd.__version__
Need to know the versions of its dependencies (numpy, matplotlib, etc)?
➡️ pd.show_versions()

In [235]:
pd.__version__

'0.25.1'

In [237]:
pd.show_versions()


INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.4.final.0
python-bits      : 64
OS               : Windows
OS-release       : 8.1
machine          : AMD64
processor        : AMD64 Family 22 Model 48 Stepping 1, AuthenticAMD
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 0.25.1
numpy            : 1.16.5
pytz             : 2019.3
dateutil         : 2.8.0
pip              : 19.2.3
setuptools       : 41.4.0
Cython           : 0.29.13
pytest           : 5.2.1
hypothesis       : None
sphinx           : 2.2.0
blosc            : None
feather          : None
xlsxwriter       : 1.2.1
lxml.etree       : 4.4.1
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.8.0
pandas_datareader: None
bs4              : 4.8.0
bottleneck       : 1.2.1
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.4.1
matplotli

### Need to check if two Series are "similar"? Use this: pd.testing.assert_series_equal(df.A, df.B, ...)
Useful arguments include:
➡️ check_names=False
➡️ check_dtype=False
➡️ check_exact=False

In [240]:
df14 = pd.DataFrame({'A':[1,2,3],'B':[1.0,2.0,3.0],'C':[1.000000,2.000000,3.000005]})
df14


Unnamed: 0,A,B,C
0,1,1.0,1.0
1,2,2.0,2.0
2,3,3.0,3.000005


###  equals() method requires identical datatypes:

In [241]:
df14['A'].equals(df14['B'])

False

### Assertion passes since we aren't checking datatypes:

In [244]:
pd.testing.assert_series_equal(df14['A'],df14['B'],check_names=False,check_dtype=False)

### Assertion passes the numbers are very similar:

In [246]:
pd.testing.assert_series_equal(df14['B'],df14['C'],check_names=False,check_exact=False)

### Want to scrape a web page? Try read_html()!
#### Definitely worth trying before bringing out a more complex tool (Beautiful Soup, Selenium, e

In [247]:
apple_stock =pd.read_html('https://finance.yahoo.com/quote/AAPL?p-AAL')

In [248]:
pd.concat([apple_stock[0],apple_stock[1]])

Unnamed: 0,0,1
0,Previous Close,323.62
1,Open,322.63
2,Bid,0.00 x 800
3,Ask,0.00 x 1100
4,Day's Range,318.21 - 324.64
5,52 Week Range,169.50 - 327.85
6,Volume,25141489
7,Avg. Volume,29928195
0,Market Cap,1.401T
1,Beta (5Y Monthly),1.28


### Want to create new columns (or overwrite existing columns) within a method chain? Use "assign"!

In [249]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [254]:
df15 =drinks.set_index('country')
df15

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,0,0,0,0.0,Asia
Albania,89,132,54,4.9,Europe
...,...,...,...,...,...
Zambia,32,19,4,2.5,Africa
Zimbabwe,64,18,4,4.7,Africa


In [256]:
(df15.assign(continent =df15['continent'].str.title(),
             beer_ounces = df15['beer_servings'] * 12,
             beer_gallons = lambda df15:df15['beer_ounces']/128)
     .query('beer_gallons>30')
     .style.set_caption('Average beer conmsumption per person in 2010'))

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent,beer_ounces,beer_gallons
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Czech Republic,361,170,134,11.8,Europe,4332,33.8438
Gabon,347,98,59,8.9,Africa,4164,32.5312
Germany,346,117,175,11.3,Europe,4152,32.4375
Lithuania,343,244,56,12.9,Europe,4116,32.1562
Namibia,376,3,1,6.8,Africa,4512,35.25
Poland,343,215,56,10.9,Europe,4116,32.1562
Venezuela,333,100,3,7.7,South America,3996,31.2188


### Need to create a bunch of new columns based on existing columns? Use this pattern:
for col in df.columns:
    df[f'{col}_new'] = df[col].apply(my_function)

In [257]:
df16 = pd.DataFrame({'state':['ny','CA','Tx','FI'],'country':['usa','USA','usa','USA']})

In [258]:
df16

Unnamed: 0,state,country
0,ny,usa
1,CA,USA
2,Tx,usa
3,FI,USA


In [259]:
for col in df16.columns:
    df16[f'{col}_fixed'] = df16[col].str.upper()

In [260]:
df16

Unnamed: 0,state,country,state_fixed,country_fixed
0,ny,usa,NY,USA
1,CA,USA,CA,USA
2,Tx,usa,TX,USA
3,FI,USA,FI,USA


### You can use f-strings (Python 3.6+) when selecting a Series from a DataFrame!


In [261]:
df17 = drinks

In [262]:
df17.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [263]:
drink ='wine'

In [264]:
df17[f'{drink}_servings']

0       0
1      54
       ..
191     4
192     4
Name: wine_servings, Length: 193, dtype: int64

### Did you get a "SettingWithCopyWarning" when creating a new column? You are probably assigning to a DataFrame that was created from another DataFrame.


In [265]:
df18 = pd.DataFrame({'gender':['Male','Female','Male','Female']})
df18

Unnamed: 0,gender
0,Male
1,Female
2,Male
3,Female


In [271]:
males = df18[df18['gender'] == 'Male'] 
males

Unnamed: 0,gender
0,Male
2,Male


In [270]:
males['abbervation'] ='M'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


### Solution: Use the "copy" method when copying a DataFrame!

In [272]:
males = df18[df18['gender'] == 'Male'].copy()

In [273]:
males['abbervation'] ='M'

In [274]:
males

Unnamed: 0,gender,abbervation
0,Male,M
2,Male,M


### Want to read a JSON file from the web? Use read_json() to read it directly from a URL into a DataFrame! 

In [275]:
df19 = pd.read_json('https://api.github.com/users/justmarkham/repos?per+page=100')
df19

Unnamed: 0,id,node_id,name,full_name,private,owner,html_url,description,fork,url,...,forks_count,mirror_url,archived,disabled,open_issues_count,license,forks,open_issues,watchers,default_branch
0,62405159,MDEwOlJlcG9zaXRvcnk2MjQwNTE1OQ==,awesome-datascience,justmarkham/awesome-datascience,False,"{'login': 'justmarkham', 'id': 6509492, 'node_...",https://github.com/justmarkham/awesome-datasci...,:memo: An awesome Data Science repository to l...,True,https://api.github.com/repos/justmarkham/aweso...,...,56,,False,False,0,"{'key': 'mit', 'name': 'MIT License', 'spdx_id...",56,0,29,master
1,62405816,MDEwOlJlcG9zaXRvcnk2MjQwNTgxNg==,awesome-machine-learning,justmarkham/awesome-machine-learning,False,"{'login': 'justmarkham', 'id': 6509492, 'node_...",https://github.com/justmarkham/awesome-machine...,A curated list of awesome Machine Learning fra...,True,https://api.github.com/repos/justmarkham/aweso...,...,45,,False,False,0,"{'key': 'other', 'name': 'Other', 'spdx_id': '...",45,0,18,master
2,24103978,MDEwOlJlcG9zaXRvcnkyNDEwMzk3OA==,babynames,justmarkham/babynames,False,"{'login': 'justmarkham', 'id': 6509492, 'node_...",https://github.com/justmarkham/babynames,Baby Names by Birth Year,False,https://api.github.com/repos/justmarkham/babyn...,...,5,,False,False,0,,5,0,3,master
3,134311703,MDEwOlJlcG9zaXRvcnkxMzQzMTE3MDM=,challenges,justmarkham/challenges,False,"{'login': 'justmarkham', 'id': 6509492, 'node_...",https://github.com/justmarkham/challenges,PyBites Code Challenges,True,https://api.github.com/repos/justmarkham/chall...,...,8,,False,False,0,,8,0,2,master
4,18948892,MDEwOlJlcG9zaXRvcnkxODk0ODg5Mg==,coursera-getting-data,justmarkham/coursera-getting-data,False,"{'login': 'justmarkham', 'id': 6509492, 'node_...",https://github.com/justmarkham/coursera-gettin...,"Class project for Coursera's ""Getting and Clea...",False,https://api.github.com/repos/justmarkham/cours...,...,59,,False,False,0,,59,0,6,master
5,17607592,MDEwOlJlcG9zaXRvcnkxNzYwNzU5Mg==,crisp-ghost-theme,justmarkham/crisp-ghost-theme,False,"{'login': 'justmarkham', 'id': 6509492, 'node_...",https://github.com/justmarkham/crisp-ghost-theme,"A minimalist, responsive, and open-source them...",True,https://api.github.com/repos/justmarkham/crisp...,...,1,,False,False,0,,1,0,3,master
6,24203980,MDEwOlJlcG9zaXRvcnkyNDIwMzk4MA==,DAT3,justmarkham/DAT3,False,"{'login': 'justmarkham', 'id': 6509492, 'node_...",https://github.com/justmarkham/DAT3,General Assembly's Data Science course in Wash...,False,https://api.github.com/repos/justmarkham/DAT3,...,353,,False,False,0,,353,0,563,master
7,27836310,MDEwOlJlcG9zaXRvcnkyNzgzNjMxMA==,DAT4,justmarkham/DAT4,False,"{'login': 'justmarkham', 'id': 6509492, 'node_...",https://github.com/justmarkham/DAT4,General Assembly's Data Science course in Wash...,False,https://api.github.com/repos/justmarkham/DAT4,...,615,,False,False,1,,615,1,713,master
8,31725376,MDEwOlJlcG9zaXRvcnkzMTcyNTM3Ng==,DAT5,justmarkham/DAT5,False,"{'login': 'justmarkham', 'id': 6509492, 'node_...",https://github.com/justmarkham/DAT5,General Assembly's Data Science course in Wash...,False,https://api.github.com/repos/justmarkham/DAT5,...,179,,False,False,0,,179,0,155,master
9,35706622,MDEwOlJlcG9zaXRvcnkzNTcwNjYyMg==,DAT7,justmarkham/DAT7,False,"{'login': 'justmarkham', 'id': 6509492, 'node_...",https://github.com/justmarkham/DAT7,General Assembly's Data Science course in Wash...,False,https://api.github.com/repos/justmarkham/DAT7,...,189,,False,False,0,,189,0,218,master


In [280]:
df19.columns

Index(['id', 'node_id', 'name', 'full_name', 'private', 'owner', 'html_url',
       'description', 'fork', 'url', 'forks_url', 'keys_url',
       'collaborators_url', 'teams_url', 'hooks_url', 'issue_events_url',
       'events_url', 'assignees_url', 'branches_url', 'tags_url', 'blobs_url',
       'git_tags_url', 'git_refs_url', 'trees_url', 'statuses_url',
       'languages_url', 'stargazers_url', 'contributors_url',
       'subscribers_url', 'subscription_url', 'commits_url', 'git_commits_url',
       'comments_url', 'issue_comment_url', 'contents_url', 'compare_url',
       'merges_url', 'archive_url', 'downloads_url', 'issues_url', 'pulls_url',
       'milestones_url', 'notifications_url', 'labels_url', 'releases_url',
       'deployments_url', 'created_at', 'updated_at', 'pushed_at', 'git_url',
       'ssh_url', 'clone_url', 'svn_url', 'homepage', 'size',
       'stargazers_count', 'watchers_count', 'language', 'has_issues',
       'has_projects', 'has_downloads', 'has_wiki', 'has

In [281]:
df19 =df19[df19['fork'] == False]
df19.shape

(18, 73)

In [285]:
cols = ['name','stargazers_count','forks_count']
df19[cols].sort_values('stargazers_count',ascending=False).head(10)

Unnamed: 0,name,stargazers_count,forks_count
10,DAT8,1413,803
22,pandas-videos,997,960
7,DAT4,713,615
6,DAT3,563,353
24,pycon-2016-tutorial,363,359
25,pycon-2018-tutorial,249,181
26,pycon-2019-tutorial,228,130
9,DAT7,218,189
8,DAT5,155,179
15,dplyr-tutorial,128,214


### Need to calculate a running total (or "cumulative sum")? Use the cumsum() function! Also works with groupby()

In [286]:
df20 = pd.DataFrame({'salesperson':['Alice','Bob','Alice','Bob','Alice','Charlie'],
                    'sales':[100,50,120,70,100,30]})

In [287]:
df20

Unnamed: 0,salesperson,sales
0,Alice,100
1,Bob,50
2,Alice,120
3,Bob,70
4,Alice,100
5,Charlie,30


In [292]:
df20['running_total'] = df20['sales'].cumsum()
df20['total_running_by_person'] = df20.groupby('salesperson').sales.cumsum()
df20

Unnamed: 0,salesperson,sales,running_total,total_running_by_person
0,Alice,100,100,100
1,Bob,50,150,50
2,Alice,120,270,220
3,Bob,70,340,120
4,Alice,100,440,320
5,Charlie,30,470,30


### Need to calculate a running count within groups? Do this:
df.groupby('col').cumcount() + 1

In [293]:
df21 = pd.DataFrame({'salesperson':['Alice','Alice','Bob','Alice','Charlie','Bob','Charlie'],
                    'item':['car','truck','car','car','car','car','truck']})

df21

Unnamed: 0,salesperson,item
0,Alice,car
1,Alice,truck
2,Bob,car
3,Alice,car
4,Charlie,car
5,Bob,car
6,Charlie,truck


In [294]:
df21['count_by_person'] = df21.groupby('salesperson').cumcount() + 1
df21['count_by_item'] = df21.groupby('item').cumcount() + 1
df21['count_by_both'] =df21.groupby(['salesperson','item']).cumcount() + 1

In [295]:
df21

Unnamed: 0,salesperson,item,count_by_person,count_by_item,count_by_both
0,Alice,car,1,1,1
1,Alice,truck,2,1,1
2,Bob,car,1,2,1
3,Alice,car,3,3,2
4,Charlie,car,1,4,1
5,Bob,car,2,5,2
6,Charlie,truck,2,2,1
