# Instructor-led Lab: Advanced Data Manipulation     &#x1f4ca;  

In this assignment, we will go over the new skills in data manipulation with the *piping* expression! <br>

## Sorting and Ordering Data <br>

We will go over sorting, ordering, and filtering data with some more advanced techniques and piping expressions! <br>

Let's begin by importing the modules

In [1]:
import os
import numpy as np  #Imports the "numpy" package
import pandas as pd  #Imports the "pandas" package

Now let's make sure that we are in the correct working directory

In [2]:

#Prints the current working directory

os.getcwd()  

'/home/school/Documents/python/week_8'

Sweet! Looks like we are already in the correct working directory! <br>
Now let's load in the data for the lab. We will be using the "github_teams" data provided:

In [3]:

#Reads in a .csv file

github_teams = pd.read_csv("data/github_teams.csv")

#Quick Inspect

github_teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   name_h                    608 non-null    object 
 1   Team_type                 608 non-null    object 
 2   Team_size_class           608 non-null    object 
 3   human_members_count       608 non-null    int64  
 4   bot_members_count         608 non-null    int64  
 5   human_work                608 non-null    int64  
 6   work_per_human            608 non-null    float64
 7   human_gini                608 non-null    float64
 8   human_Push                608 non-null    int64  
 9   human_IssueComments       608 non-null    int64  
 10  human_PRReviewComment     608 non-null    int64  
 11  human_MergedPR            608 non-null    int64  
 12  bot_work                  608 non-null    int64  
 13  bot_Push                  608 non-null    int64  
 14  bot_IssueC

Awesome! Now we can see that the "github_teams" dataset has successfully been read into python! <br>
Now let's start working with the data:

In [4]:

#Selects/Filters the columns with the respective names

(github_teams
.filter(["Team_type","human_work","work_per_human"]))

Unnamed: 0,Team_type,human_work,work_per_human
0,human-bot,66,33.000000
1,human,62,31.000000
2,human,211,30.142857
3,human-bot,14579,62.303419
4,human-bot,1625,42.763158
...,...,...,...
603,human-bot,855,285.000000
604,human,63,31.500000
605,human,26,5.200000
606,human,13,2.600000


In [5]:

# Select columns that end in the letter "t"

(github_teams
.filter(regex="t$"))

Unnamed: 0,human_members_count,bot_members_count,human_PRReviewComment,bot_PRReviewComment,issues_count
0,2,1,4,0,8.0
1,2,0,0,0,
2,7,0,1,0,46.0
3,234,12,1170,0,4757.0
4,38,8,152,0,777.0
...,...,...,...,...,...
603,3,1,373,0,
604,2,0,0,0,
605,5,0,2,0,
606,5,0,1,0,10.0


In [6]:

#Sorts the data in descending order using the "Team_size_class" column

(github_teams
.sort_values(by=("Team_size_class"), ascending=False))

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
0,_1bqaxzCk0sfQaunsjeViQ/RCEZ3CASdLXbstu9y2JQ7Q,human-bot,Small,2,1,66,33.000000,0.287879,29,33,4,0,43,0,43,0,0,87.0,8.0
274,I1zTLQ0sX0_e1IKG0DL4FQ/1obWttaaw8DkHzSckiVWpw,human,Small,2,0,22,11.000000,0.181818,22,0,0,0,0,0,0,0,0,,
326,kRXHmTG1hg25Ms4CoQXuPA/ZDjGaeDQ3pjmkpwswPrEdA,human,Small,2,0,29,14.500000,0.258621,29,0,0,0,0,0,0,0,0,,
323,kozDEuBo0KKt8iyeAtO7pw/Khx_dnicluL6V8QNDEKmVQ,human,Small,3,0,103,34.333333,0.362460,101,2,0,0,0,0,0,0,0,,
320,kcU_qvEH_y-BumXu1QpNnQ/FL4dW60P1hcLJtWhNFzrkw,human,Small,2,0,28,14.000000,0.000000,28,0,0,0,0,0,0,0,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
439,ReeJIVTm1BiV_SO5tNDVFA/0DyNJHtC341DeboKhr6ZAw,human-bot,Large,10,1,693,69.300000,0.578211,97,559,37,0,1000,997,3,0,0,91.0,130.0
441,ReeJIVTm1BiV_SO5tNDVFA/ZfIj2C4Weg8to684MM7PGA,human-bot,Large,15,2,2211,147.400000,0.544068,214,1312,685,0,50,0,47,3,0,9.0,6.0
448,ROWwIEddQryJzc7vl7nTWQ/_GEQzVeMWyYnaL12CEJ2-g,human,Large,10,0,186,18.600000,0.553763,149,36,1,0,0,0,0,0,0,64.0,17.0
455,sdJYSIlyI8h0ITiqLCLVvw/m4OHr6Thm04-BYsBZmXGkQ,human-bot,Large,7,1,590,84.285714,0.511864,240,82,268,0,14,0,14,0,0,7.0,7.0


In [7]:
#Sorts the data in descending order using the "human_work" column

(github_teams
.sort_values(by=("human_work"), ascending=False))

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
230,FAhkB4rsocfDW0vrM8U8NA/x48tfDezFe8qzB_Gbe_tMA,human-bot,Large,54,1,15448,286.074074,0.625010,2056,7944,5382,66,282,0,282,0,0,9.0,1015.0
3,_l5u7I5p4thtW5SjR_9_4w/aZNCdVXta7fh7eCMzZP1CA,human-bot,Large,234,12,14579,62.303419,0.738342,1942,11430,1170,37,1972,0,1972,0,0,1.0,4757.0
534,vjYCi8YxMpUj_LaqRdiCXw/wPCNTC9mtdWb8MKJVr439g,human,Large,28,0,10705,382.321429,0.743961,4230,2337,3997,141,0,0,0,0,0,4.0,21.0
37,0YuGJKD19yHae4I2X8Vi3Q/gWpl7HC8bRe4tH8S7ASf9w,human-bot,Large,84,1,9644,114.809524,0.807448,1095,5073,3365,111,329,0,329,0,0,2.0,2993.0
89,5Is-_ie16OEGmW1arZm8qg/8UeSk2P76pTG7pPLtxsHTQ,human-bot,Large,17,4,7412,436.000000,0.439621,4182,1257,1917,56,358,5,202,151,0,2.0,495.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180,CWghQeCv2EVQNS-Wb6M_EA/N0nGHZBEl2y-GPcfOGvytQ,human,Small,2,0,9,4.500000,0.166667,7,2,0,0,0,0,0,0,0,,
25,0Jb-T8Py7O9d0fsJhO-Fpw/9CTgx1gdJFtNJTvxhIZZUA,human-bot,Small,2,1,8,4.000000,0.125000,6,1,1,0,13,13,0,0,0,,
237,FJmB0zbVT0ileOMUPtWRIQ/v2hyhTxDNjcQKAdrTbpb-g,human-bot,Small,3,1,8,2.666667,0.416667,8,0,0,0,24,24,0,0,0,,
60,3VFbLRx-am2PA7KH0P_qQQ/80s2Rw6RQNhfx0QKHYY0Og,human-bot,Small,2,1,7,3.500000,0.071429,6,1,0,0,5,3,2,0,0,,


In [8]:
#Sorts the data in descending order using the "work_per_human" column

(github_teams
.sort_values(by=("work_per_human"), ascending=False))

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
559,WLaEz_1Nf-YWzHZa8bBgAA/pLoAhZ1cbPT38VYoSdXGmg,human,Small,3,0,3040,1013.333333,0.292105,434,2606,0,0,0,0,0,0,0,3.0,365.0
209,eIPosZ68E2LjtaixYK65EQ/0Rp6D1ZR1w4YspfD1H-PfA,human-bot,Small,3,1,1639,546.333333,0.413260,1114,283,223,19,156,0,156,0,0,21.0,204.0
268,hXoZRbHPbVxh--funPXSiw/iNU0l_SpKVjGfHOp8vUt8w,human-bot,Small,2,2,910,455.000000,0.065934,207,369,334,0,114,0,114,0,0,4.0,64.0
89,5Is-_ie16OEGmW1arZm8qg/8UeSk2P76pTG7pPLtxsHTQ,human-bot,Large,17,4,7412,436.000000,0.439621,4182,1257,1917,56,358,5,202,151,0,2.0,495.0
88,572IwNWvcAWEW0XmL9kzxg/dsdz8N9WSpmtzc0q9veAhQ,human-bot,Large,9,1,3774,419.333333,0.504151,2733,958,5,78,575,0,575,0,0,27.0,1210.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
497,tzOdbObpFg-vSbow3RBUBg/siHdJSIMcipwQwok068VBg,human,Large,115,0,286,2.486957,0.322043,286,0,0,0,0,0,0,0,0,,
600,ZPbZUFullnQ00GjBX1RaUQ/fR24eYhOcMGsDUKZIYFXeg,human,Medium,6,0,14,2.333333,0.190476,13,1,0,0,0,0,0,0,0,,
484,TcHjRpnzF2qIwAEHHGAFAA/siHdJSIMcipwQwok068VBg,human,Large,68,0,155,2.279412,0.345446,155,0,0,0,0,0,0,0,0,,
292,ilEEnZRNdUqNk7tvvGBo1Q/8xHANrgyHiFV3TuhcqwrpQ,human,Large,8,0,14,1.750000,0.250000,14,0,0,0,0,0,0,0,0,,


In [9]:

#Selects the "human-bot" teams that have a "bot_members_count" greater than or equal to 3

(github_teams
.query("Team_type == 'human-bot' & bot_members_count >= 3"))

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
3,_l5u7I5p4thtW5SjR_9_4w/aZNCdVXta7fh7eCMzZP1CA,human-bot,Large,234,12,14579,62.303419,0.738342,1942,11430,1170,37,1972,0,1972,0,0,1.0,4757.0
4,_l5u7I5p4thtW5SjR_9_4w/m_FpD7PKQHqVXHn2bh7u2g,human-bot,Large,38,8,1625,42.763158,0.666607,203,1270,152,0,302,0,302,0,0,2.0,777.0
89,5Is-_ie16OEGmW1arZm8qg/8UeSk2P76pTG7pPLtxsHTQ,human-bot,Large,17,4,7412,436.0,0.439621,4182,1257,1917,56,358,5,202,151,0,2.0,495.0


In [10]:

#Selects the "human" teams that are large and have a 'human-gini' value greater than or equal to 0.75

(github_teams
.query("Team_type == 'human' & Team_size_class == 'Large' & human_gini >= 0.75"))



Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
138,ASYGR96YA91p3z7MNKjZCA/IB2pZ8ygcvNnlxUdysjSFA,human,Large,12,0,1655,137.916667,0.799446,793,684,178,0,0,0,0,0,0,4.0,190.0
285,IiUao8vA_zm_uEIVVLI-Sw/91ya8vlSP8qgwCllH_6BSw,human,Large,25,0,3599,143.96,0.863507,1249,2350,0,0,0,0,0,0,0,0.0,1245.0
505,uLHPO58cQefwrJUbyhYOKQ/7YWOP8uDEeKDHQMWKqOoYA,human,Large,48,0,5748,119.75,0.78204,1715,3891,142,0,0,0,0,0,0,0.0,1200.0
582,y8Jw59EHVSrsluSuhR5okg/V5vb074jNkzg4YCKforX1Q,human,Large,8,0,277,34.625,0.781137,275,2,0,0,0,0,0,0,0,,


In [11]:

#Determines how many teams are in the 'Small' or 'Large' category

(github_teams
.query("Team_size_class != 'Medium'")
.agg({'name_h':("count")})
)



name_h    428
dtype: int64

There seems to be 428 teams that are either 'Small' or 'Large'.

In [12]:

#Determines how many teams are in the 'Small' or 'Large' category with a 'human_gini' 
#value less than or equal to 0.25.


(github_teams
.query("Team_size_class != 'Medium' & human_gini <= 0.25")
.agg({'name_h':("count")})
)


name_h    89
dtype: int64

There are 89 teams that are 'Small' or 'Large' with a 'human_gini' values less than or equal to 0.25.

In [13]:

#Counts how many 'human' teams are in the 'Medium' category

(github_teams
.query("Team_type == 'human' & Team_size_class == 'Medium'")
.agg({'name_h':("count")})
)



name_h    96
dtype: int64

There are 96 'Human' teams that are 'medium' sized.

In [14]:

#Saves the 'Team_size_class' and 'work_per_human' columns into a new dataframe

new_github_teams = (github_teams
.filter(["Team_size_class","work_per_human"]))

#Print to confirm

new_github_teams.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Team_size_class  608 non-null    object 
 1   work_per_human   608 non-null    float64
dtypes: float64(1), object(1)
memory usage: 9.6+ KB


In [15]:

#Renames the columns in the dataframe:

# 'human_gini' -> 'work_inequality'
# 'eval_survival_day_median' -> 'issue_resolution_time'


github_teams_rename = (github_teams
.rename(columns={
    'human_gini':'work_inequality',
    'eval_survival_day_median':'issue_resolution_time'
}))

#Print to confirm

github_teams_rename.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   name_h                 608 non-null    object 
 1   Team_type              608 non-null    object 
 2   Team_size_class        608 non-null    object 
 3   human_members_count    608 non-null    int64  
 4   bot_members_count      608 non-null    int64  
 5   human_work             608 non-null    int64  
 6   work_per_human         608 non-null    float64
 7   work_inequality        608 non-null    float64
 8   human_Push             608 non-null    int64  
 9   human_IssueComments    608 non-null    int64  
 10  human_PRReviewComment  608 non-null    int64  
 11  human_MergedPR         608 non-null    int64  
 12  bot_work               608 non-null    int64  
 13  bot_Push               608 non-null    int64  
 14  bot_IssueComments      608 non-null    int64  
 15  bot_PR

# References

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html