# Instructor-led Lab: Manipulating Data


  
# Instructions:

In this assignment you will practice utilizing DataFrames in a program. 

For this lab, you will use the [github_teams.csv](/data/github_team.csv) which contains behavioral trace data extracted from GitHub.

## Accessing Data

Using the `github_teams` dataset, please perform the following operations in order:

* Open the file within Python.
* Find out what the column header names are.
* Determine the number of columns.
* Determine the number of rows.
* Determine which columns are categorical and convert them from *object* to *category*.
* How many unique values does `Team_type` have?
* How many unique values does `Team_size_class` have?
* What is the value of the 63rd row and 6th column?
* What are the values for the 300th row?
* Using three different methods, select row with index value 595 with 1st, 2nd, 3rd columns.
* Using two different methods, select the row with index value 46 with the 3rd and 7th columns.
* Create a new DataFrame for the column `bot_work` using two different methods.

## Sorting and Ordering data 

Now that you have learned to subsample data, it is your turn to apply your new knowledge. Using the `github_teams` dataset, please perform the following operations in order:

* Select `human-bot` teams that have a `bot_members_count` value greater than and equal to 2.
* Find the `human` teams that are `Large` and have a `human_gini` value greater than and equal to 0.75.
* How many teams are in the `Small` or `Large` category?
* How many teams are in the `Small` or `Large` cateogry with a `human_gini` value less than and equal to 0.20?
* How many `human-bot` teams are in the `Medium` category?
* Create a subsample of 50% of your data.
* Create samples for a 8-fold cross validation test.
* Select columns that are numeric and save it as a new DataFrame.
* Remove the columns `bot_PRReviewComment` and `bot_MergedPR` from the DataFrame.
* Save the columns `Team_size_class` and `human_members_count` as a new DataFrame.
* Rename these two columns in the new DataFrame.

Save your notebook with output displayed within it and submit for grading.

### Import Modules

In [16]:

#Import the various modules into Jupyter Notebook so that they can be used later

import os
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

Great! We have the various modules imported into Jupyter Notebook! Now we need to read in the two .csv files which contain the necessary data!  

### Accessing Data:

In [3]:
#First, let's go through and find our current working directory.
os.getcwd()

'/home/schoo/Documents/python/week_7'

Awesome! We can see that the current working directory is the appropriate one with all of the necessary files!  
Now, let's read in the data!  

There is only one .csv file that needs to be read into Jupyter Notebook, that being the github_teams.csv file. This file can be found in the "\data\" directory within the "\week_7\" directory. 

In [4]:

#Read in the provided file.

teams = pd.read_csv('data/github_teams.csv') 

Awesomesauce! We now have the .csv file in Jupyter Notebook! Let's begin our work by taking a quick look at the file and find out what the header names are using the '.columns' function!  

### Inspect Columns

In [5]:

#Fancy, but not efficient way to print the column names.

for column_name in teams.columns:
    print(column_name)

name_h
Team_type
Team_size_class
human_members_count
bot_members_count
human_work
work_per_human
human_gini
human_Push
human_IssueComments
human_PRReviewComment
human_MergedPR
bot_work
bot_Push
bot_IssueComments
bot_PRReviewComment
bot_MergedPR
eval_survival_day_median
issues_count


Wow! That is a lot of columns. Let's count how many there are!

In [6]:

#.shape returns the height and the width of a dataframe, the index [1] contains the width.

print(f"Number of Columns: {teams.shape[1]}") 

Number of Columns: 19


Now we know that there are 19 different columns within the teams dataframe. Why don't we now take a look at the height of the dataframe to figure out how many rows there are?

In [7]:

#.shape returns the height and the width of a dataframe, the index [0] contains the height.

print(f"Number of Rows: {teams.shape[0]}") 

Number of Rows: 608


There are 608 rows within the teams dataframe.  

Now that we have an idea of the size of the teams dataframe, why don't we take a closer look at the columns and what type they are? 

In [8]:

#Print out all of the columns.

teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   name_h                    608 non-null    object 
 1   Team_type                 608 non-null    object 
 2   Team_size_class           608 non-null    object 
 3   human_members_count       608 non-null    int64  
 4   bot_members_count         608 non-null    int64  
 5   human_work                608 non-null    int64  
 6   work_per_human            608 non-null    float64
 7   human_gini                608 non-null    float64
 8   human_Push                608 non-null    int64  
 9   human_IssueComments       608 non-null    int64  
 10  human_PRReviewComment     608 non-null    int64  
 11  human_MergedPR            608 non-null    int64  
 12  bot_work                  608 non-null    int64  
 13  bot_Push                  608 non-null    int64  
 14  bot_IssueC

Above we can see the 19 different columns in the teams dataframe. On the left-hand side we can see the names of the columns, and on the right we can see the data types for each individual column. We can see that there are three different categorical columns: 'name_h', 'Team_type', and 'Team_size_class'. Let's now address them properly by changing the data type over from 'object' to 'category'.

In [9]:
# We need to go through the three different categorical columns and convert them to the 'category' data type.

teams['name_h'] = teams['name_h'].astype('category')
teams['Team_type'] = teams['Team_type'].astype('category')
teams['Team_size_class'] = teams['Team_size_class'].astype('category')

# Now that let's confirm the changes to our data frame.

teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   name_h                    608 non-null    category
 1   Team_type                 608 non-null    category
 2   Team_size_class           608 non-null    category
 3   human_members_count       608 non-null    int64   
 4   bot_members_count         608 non-null    int64   
 5   human_work                608 non-null    int64   
 6   work_per_human            608 non-null    float64 
 7   human_gini                608 non-null    float64 
 8   human_Push                608 non-null    int64   
 9   human_IssueComments       608 non-null    int64   
 10  human_PRReviewComment     608 non-null    int64   
 11  human_MergedPR            608 non-null    int64   
 12  bot_work                  608 non-null    int64   
 13  bot_Push                  608 non-null    int64   

Sweet! Our three categorical variables have succesfully been converted to the 'category' data type!  
Now let's go through and see how many unique values the 'Team_type' and 'Team_size_class' columnes have!

In [83]:

#Count the unique values in the 'Team_type' column.

teams.Team_type.unique()  

['human-bot', 'human']
Categories (2, object): ['human', 'human-bot']

We can see that there are two different unique values in the 'Team_type' variable. Now how many unique values does the 'Team_size_class' column contain?

In [85]:

#Count the unique values.

teams.Team_size_class.unique()

['Small', 'Large', 'Medium']
Categories (3, object): ['Large', 'Medium', 'Small']

Sweet! There are three different unique values in the 'Team_size_class' column.  
Now that we have gone through and looked at the shape of the dataframe, the columns names, changed data types, and counted unique values; let's now play around with some indexing.  
First value we need to find is the 63rd row in the 6th column:

In [115]:

#Select the value with the appropriate numerical column and row index.

teams.iloc[62, 5]

35

Now, let's print out the values that are in row 300:

In [123]:

#Select the row given the appropriate numerical row index.

teams.iloc[299:300]

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
299,IyfocAGfAHLncCVJUujqTA/A_QZ6HlUb5sRQHhPa7SGzQ,human-bot,Medium,4,1,1049,262.25,0.448761,739,213,91,6,52,0,52,0,0,27.0,243.0


### Note: Indexing is offset by 1, meaning that the actual value is +1 of the displayed value. This can be seen in the 300th row. Since the rows start at index 0, the 300th row will be displayed as row 299.

Great! Now that we have gone over the basics of indexing, let's now go into different methods for selecting certain values in a data frame.  
For the first section, we will use three different methods to select row with index value 595 with 1st, 2nd, 3rd columns.

In [145]:

#Select using the raw numerical indexing, with the range for the first three columns.

teams.iloc[595, 0:3]

name_h             zAh1NECRCquqUJ_-1d6hAw/DET3jTK8hokYfY_neJ1IVQ
Team_type                                              human-bot
Team_size_class                                            Small
Name: 595, dtype: object

In [147]:

#Select using the name of the columns, along with the numerical row index.

teams.loc[595, 'name_h':'Team_size_class']

name_h             zAh1NECRCquqUJ_-1d6hAw/DET3jTK8hokYfY_neJ1IVQ
Team_type                                              human-bot
Team_size_class                                            Small
Name: 595, dtype: object

In [207]:

#Create a subset based on the data type (The first three were converted to 'category').

teams_category = teams.select_dtypes(include = ['category'])

#Select the appropriate row from the created 'category' subset.

teams_category.iloc[595,:]

name_h             zAh1NECRCquqUJ_-1d6hAw/DET3jTK8hokYfY_neJ1IVQ
Team_type                                              human-bot
Team_size_class                                            Small
Name: 595, dtype: object

Great! Now we have selected specific values using three different methods!  
Let's now continue by using two different methods to select the row with index value 46 with the 3rd and 7th columns.

In [227]:

#Select using the numerical row index along with the names of the columns.

teams.loc[46, ['human_members_count','human_gini']]

human_members_count           6
human_gini             0.506981
Name: 46, dtype: object

In [229]:

#Select with numerical indexing for both row and column(s).

teams.iloc[46, [3,7]]

human_members_count           6
human_gini             0.506981
Name: 46, dtype: object

Awesome! Now that we know how to access the data, let's now go an create a dataframe using two different methods!

### Create New DataFrame

In [251]:

#Created new dataframe with the column from the original dataframe with explicit declaration.

just_bot_work = pd.DataFrame(teams.bot_work)

#Print to confirm.

type(just_bot_work)

pandas.core.frame.DataFrame

In [249]:

#Created new dataframe with a filter for the name of the column without explicit declaration.

only_bot_work = teams.filter(like = 'bot_work')

#Print to confirm.

type(only_bot_work)

pandas.core.frame.DataFrame

### References

https://www.geeksforgeeks.org/python-subset-dataframe-by-column-name/#

# Sorting and Ordering Data

We will now go through some basic exercises for sorting and ordering the data.  

First we will begin by selecting 'human-bot' teams that have a 'bot_members_count' value greater than or equal to 2:

In [259]:
teams[(teams.Team_type == 'human-bot')&(teams.bot_members_count >=2)]

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
3,_l5u7I5p4thtW5SjR_9_4w/aZNCdVXta7fh7eCMzZP1CA,human-bot,Large,234,12,14579,62.303419,0.738342,1942,11430,1170,37,1972,0,1972,0,0,1.0,4757.0
4,_l5u7I5p4thtW5SjR_9_4w/m_FpD7PKQHqVXHn2bh7u2g,human-bot,Large,38,8,1625,42.763158,0.666607,203,1270,152,0,302,0,302,0,0,2.0,777.0
42,2-scMrZv13F95YPZmfieww/4Zc56iUYjIZrZU06omFrJw,human-bot,Large,23,2,4648,202.086957,0.560241,864,2574,1174,36,1325,0,1325,0,0,11.0,1635.0
84,4YoH8row044yJjPIqWJw9Q/NSXj3i61X71lV0StTN71Ww,human-bot,Small,2,2,114,57.0,0.491228,114,0,0,0,37,0,37,0,0,0.0,14.0
89,5Is-_ie16OEGmW1arZm8qg/8UeSk2P76pTG7pPLtxsHTQ,human-bot,Large,17,4,7412,436.0,0.439621,4182,1257,1917,56,358,5,202,151,0,2.0,495.0
110,7sA-8-nyqr0Ri2CT4-FSZw/GJPQoUhHfvUsxKcdkHWLEw,human-bot,Small,3,2,244,81.333333,0.502732,171,73,0,0,136,0,136,0,0,1.0,41.0
146,bi5TY2Z4OSQq3PMs6JnKYA/5wtZcUUo1XmLHIra8NDtFQ,human-bot,Medium,4,2,170,42.5,0.717647,144,7,19,0,104,0,104,0,0,,
147,bi5TY2Z4OSQq3PMs6JnKYA/9b9IqkDK14ketwn88f3hKA,human-bot,Small,3,2,189,63.0,0.624339,174,10,5,0,125,0,125,0,0,35.0,9.0
149,bi5TY2Z4OSQq3PMs6JnKYA/kIiAIJpk6lOa6Nxf234KkQ,human-bot,Small,3,2,88,29.333333,0.636364,74,7,7,0,74,0,74,0,0,,
224,FAhkB4rsocfDW0vrM8U8NA/3KHgTzOwWtAxTXlp_mbqoA,human-bot,Large,15,2,4821,321.4,0.689096,2564,1801,386,70,270,90,116,52,12,13.0,1522.0


Next, we will find 'human' teams that are 'Large' and have a 'human_gini' value greater than or equal to 0.75:

In [261]:
teams[(teams.Team_type == 'human') & (teams.Team_size_class == 'Large') & (teams.human_gini >= 0.75)]

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
138,ASYGR96YA91p3z7MNKjZCA/IB2pZ8ygcvNnlxUdysjSFA,human,Large,12,0,1655,137.916667,0.799446,793,684,178,0,0,0,0,0,0,4.0,190.0
285,IiUao8vA_zm_uEIVVLI-Sw/91ya8vlSP8qgwCllH_6BSw,human,Large,25,0,3599,143.96,0.863507,1249,2350,0,0,0,0,0,0,0,0.0,1245.0
505,uLHPO58cQefwrJUbyhYOKQ/7YWOP8uDEeKDHQMWKqOoYA,human,Large,48,0,5748,119.75,0.78204,1715,3891,142,0,0,0,0,0,0,0.0,1200.0
582,y8Jw59EHVSrsluSuhR5okg/V5vb074jNkzg4YCKforX1Q,human,Large,8,0,277,34.625,0.781137,275,2,0,0,0,0,0,0,0,,


How many teams are in the 'Small' or 'Large' category?

In [263]:
teams[(teams.Team_size_class != 'Medium')] # We know that there are only unique values in this column, so we can just say that it is not equal to the one that we do not want

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
0,_1bqaxzCk0sfQaunsjeViQ/RCEZ3CASdLXbstu9y2JQ7Q,human-bot,Small,2,1,66,33.000000,0.287879,29,33,4,0,43,0,43,0,0,87.0,8.0
1,_9o07rGiC7DFyi-zm91Q0g/VOgMsrjYEwFAq0BY8kHqGQ,human,Small,2,0,62,31.000000,0.467742,62,0,0,0,0,0,0,0,0,,
2,_DzK53uaZXnAX3WcC0W28g/Epc4QWw5PNBQIIdvopEHDA,human,Large,7,0,211,30.142857,0.499661,194,16,1,0,0,0,0,0,0,37.0,46.0
3,_l5u7I5p4thtW5SjR_9_4w/aZNCdVXta7fh7eCMzZP1CA,human-bot,Large,234,12,14579,62.303419,0.738342,1942,11430,1170,37,1972,0,1972,0,0,1.0,4757.0
4,_l5u7I5p4thtW5SjR_9_4w/m_FpD7PKQHqVXHn2bh7u2g,human-bot,Large,38,8,1625,42.763158,0.666607,203,1270,152,0,302,0,302,0,0,2.0,777.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
601,zPjKmela4b_-cQYzslpJLQ/Y8b87jxGXuKYhNDrcC5AuA,human,Large,11,0,851,77.363636,0.267279,740,111,0,0,0,0,0,0,0,3.0,22.0
602,zro7Xud3Xy2f5CjF55l_jA/GChw8QQ_KUPepXGZGDWicQ,human-bot,Small,2,1,39,19.500000,0.397436,39,0,0,0,15,0,15,0,0,,
603,zTj5tlMWgotzJmQl7BP8wQ/iQ914_smScbUO8BI9JlE6A,human-bot,Small,3,1,855,285.000000,0.474854,423,59,373,0,26,0,26,0,0,,
604,zUBexdmYylGGpxiebXm6gg/sJXD2kulWzU35ijdY3SnBQ,human,Small,2,0,63,31.500000,0.436508,63,0,0,0,0,0,0,0,0,,


As we can see above, there are 428 teams that are in the 'Small' or 'Large' category.  

Next, we will find how many teams are in the 'Small' or 'Large' category with a 'human_gini' value less than or equal to 0.20.

In [265]:
teams[(teams.Team_size_class != 'Medium')&(teams.human_gini <= 0.2)] 

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
25,0Jb-T8Py7O9d0fsJhO-Fpw/9CTgx1gdJFtNJTvxhIZZUA,human-bot,Small,2,1,8,4.000000,0.125000,6,1,1,0,13,13,0,0,0,,
26,0Jb-T8Py7O9d0fsJhO-Fpw/D-kWwOGGgx1lyqq0SaB3Hg,human-bot,Small,2,1,2,1.000000,0.000000,2,0,0,0,69,40,29,0,0,,
27,0Jb-T8Py7O9d0fsJhO-Fpw/Fiq28J6fBxWRHY488lzXjw,human-bot,Small,3,1,35,11.666667,0.133333,19,4,12,0,28,21,7,0,0,,
30,0Jb-T8Py7O9d0fsJhO-Fpw/OeEnv21rUaAGZUVGe1lW4w,human-bot,Small,2,1,25,12.500000,0.100000,20,1,4,0,42,21,21,0,0,,
32,0Jb-T8Py7O9d0fsJhO-Fpw/TjlzNzwYdgEidxHzg9liQw,human-bot,Small,2,1,93,46.500000,0.188172,21,4,68,0,44,22,22,0,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
569,x7QU8xqefX05ZXj5iogsWQ/H8AZrLiK64D6EviC6ycKwQ,human,Small,2,0,162,81.000000,0.000000,90,19,53,0,0,0,0,0,0,57.0,14.0
591,z4XnuVnYDsJq2Kh10c6RSQ/GwurdQHVdJIEVr1-nwzeQQ,human,Small,2,0,47,23.500000,0.074468,30,15,2,0,0,0,0,0,0,,
598,zdGIfGigrvjugcDMrntSeA/djHSoRCz0x72XAy7M0faVA,human,Small,2,0,86,43.000000,0.011628,32,54,0,0,0,0,0,0,0,,
599,zDxFw19WoS8GFVmQGJXQCw/xsTexXJ1UlUoG5F8Lns-7A,human-bot,Large,59,1,198,3.355932,0.189009,114,82,2,0,9,3,5,1,0,0.0,10.0


There are only 66 teams that are in either the 'Small' or 'Large category with a 'human_gini' values less than or equal to 0.20.

How many 'human-bot' teams are in the 'Medium' category?

In [10]:
teams[(teams.Team_size_class == 'Medium')&(teams.Team_type == 'human-bot')] 

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
18,-VBsDowjPrv6Gc92IQmkCw/c0DwadNZWJRzNMqmUyFrlQ,human-bot,Medium,6,1,111,18.500000,0.656156,111,0,0,0,291,291,0,0,0,,
21,05CtW4OlNx8ZQEHqdg4D0A/qBJZzvjpWcYk3x1Fbl0ylw,human-bot,Medium,4,1,13,3.250000,0.288462,12,1,0,0,13,11,2,0,0,,
34,0tSFBIz2tzBkhwKgfSq7UQ/n51RvHDvIcpcFPMHmAop2A,human-bot,Medium,6,1,527,87.833333,0.625237,184,226,117,0,70,0,70,0,0,1.0,60.0
41,2_VbJzim2NH2AJd2zkXHoQ/eEXeqBpwx7x1M-1kMlhSuQ,human-bot,Medium,4,1,46,11.500000,0.152174,33,3,10,0,13,0,13,0,0,12.0,9.0
74,3z53uINAmVsEcPTcpD4Plw/gLii5wLAk2cRP-CoNl0m5A,human-bot,Medium,4,1,958,239.500000,0.366910,665,217,76,0,111,0,111,0,0,20.0,152.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
566,wZomVVWBfuxk9Q0JPmcxlg/r_ZBq4_RkdsqBN6mcK9OfA,human-bot,Medium,5,1,328,65.600000,0.415854,120,199,9,0,10,0,10,0,0,5.0,79.0
568,X1pcjhqqelcsRxAhcbNu-w/HJ1fkoQOuwxXJ25L7FGEcw,human-bot,Medium,5,2,281,56.200000,0.422776,95,94,85,7,51,0,51,0,0,64.0,25.0
580,y2tuHVVF3jQ1CFnzAuXvzA/KVC-dDWpsVdeVLFMNpQodg,human-bot,Medium,6,1,990,165.000000,0.680808,883,75,20,12,218,0,218,0,0,,
586,YQI5l7pwpiRU8KVKXkNjjg/fCxq0R6_z233Fdr2Ly6iug,human-bot,Medium,5,2,1644,328.800000,0.560341,214,997,433,0,684,251,433,0,0,4.0,45.0


We can see from the output above, that there are 84 'human-bot' teams that are in the 'Medium' team size category.


Next we will create a subsample of 50% of the data.

In [269]:
teams_subsample = teams.sample(frac = 0.5, replace = False)
teams_subsample

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
127,9yAG1Wl2EUQb-mgMbu14cg/0SE8Ru05hPPuZfPY-bmeFw,human,Medium,4,0,57,14.250000,0.521930,57,0,0,0,0,0,0,0,0,,
533,VjnX-IGsAsRz8EJKMp6LLA/qZSqyEmsJJ_Rr7f9toRbwQ,human-bot,Medium,6,1,559,93.166667,0.710495,534,17,2,6,46,0,46,0,0,36.0,40.0
585,YPFiQiVJuFMczRm6Imq07g/HOpZe8bhcIW-uBnBCUgQGA,human,Small,3,0,110,36.666667,0.418182,81,6,23,0,0,0,0,0,0,0.0,12.0
294,Iput0s8_LPvasQTIhOmp7g/p8HydCeDkoyeJMEuQM_H6w,human,Medium,5,0,84,16.800000,0.328571,84,0,0,0,0,0,0,0,0,,
393,orUqs7eqk5gLzHyHtZHnBQ/n5wY3lp1zD9WG_Pe9HlNCg,human-bot,Small,3,1,193,64.333333,0.556131,42,129,22,0,32,0,32,0,0,3.0,53.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
565,Wnm0-8FGezdR3V0XA1l8OQ/xoxVnZVtTKIPQ17XSm5x5g,human,Small,2,0,51,25.500000,0.166667,51,0,0,0,0,0,0,0,0,,
435,RAtEUtifcd1nLjGg-UKXBg/Sk3uzPW33N-7C0H9hEC6aQ,human,Small,3,0,70,23.333333,0.142857,70,0,0,0,0,0,0,0,0,,
33,0MSCypBPqH6wOKkMednZLg/M-Fsij2kAaViWDhjjMVjSw,human-bot,Large,7,1,875,125.000000,0.553469,802,63,3,7,23,0,23,0,0,5.0,30.0
406,pSEjWkAL9xy8WUmVHEtdOA/Vn1iiavAqm0rljqqdGBmPg,human,Small,2,0,116,58.000000,0.448276,116,0,0,0,0,0,0,0,0,,


Create samples for a 8-fold cross validation test.

In [18]:
KF = KFold(n_splits = 8) # K-FOld cross-validator with 8 folds

for train, test in KF.split(teams):
    print("%s %s" % (train, test))

[ 76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93
  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111
 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201
 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219
 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237
 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273
 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291
 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309
 310 311 312 313 314 315 316 317 318 319 320 321 32

In [19]:
teams.iloc[train]

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
0,_1bqaxzCk0sfQaunsjeViQ/RCEZ3CASdLXbstu9y2JQ7Q,human-bot,Small,2,1,66,33.000000,0.287879,29,33,4,0,43,0,43,0,0,87.0,8.0
1,_9o07rGiC7DFyi-zm91Q0g/VOgMsrjYEwFAq0BY8kHqGQ,human,Small,2,0,62,31.000000,0.467742,62,0,0,0,0,0,0,0,0,,
2,_DzK53uaZXnAX3WcC0W28g/Epc4QWw5PNBQIIdvopEHDA,human,Large,7,0,211,30.142857,0.499661,194,16,1,0,0,0,0,0,0,37.0,46.0
3,_l5u7I5p4thtW5SjR_9_4w/aZNCdVXta7fh7eCMzZP1CA,human-bot,Large,234,12,14579,62.303419,0.738342,1942,11430,1170,37,1972,0,1972,0,0,1.0,4757.0
4,_l5u7I5p4thtW5SjR_9_4w/m_FpD7PKQHqVXHn2bh7u2g,human-bot,Large,38,8,1625,42.763158,0.666607,203,1270,152,0,302,0,302,0,0,2.0,777.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
527,vD1mLoQ_CzlsnXcM1E_WOA/XM83BYKoM4QtXWStPYssrw,human-bot,Small,2,1,106,53.000000,0.433962,99,7,0,0,18,0,18,0,0,,
528,vD1mLoQ_CzlsnXcM1E_WOA/Yk03OqUO6JLk05uYj9RK6Q,human-bot,Small,2,1,50,25.000000,0.440000,48,2,0,0,10,0,10,0,0,,
529,VemkfETAeqVyhw0s77AlUw/ERsm7IP5zmTUg2sWF5b-cQ,human-bot,Large,14,1,1878,134.142857,0.753461,1409,465,4,0,597,0,597,0,0,4.0,186.0
530,VGYUfaNcvjujHdwS_xv9xA/dts2QrkRvgHdrxZCJYXG6w,human-bot,Small,2,1,90,45.000000,0.244444,36,32,22,0,12,0,12,0,0,1.0,9.0


In [20]:
teams.iloc[test]

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
532,VjnX-IGsAsRz8EJKMp6LLA/HQ9CHS_vbj1CSd6OM0aCoA,human-bot,Large,12,1,3609,300.750000,0.703773,2964,559,76,10,34,0,24,0,10,30.0,518.0
533,VjnX-IGsAsRz8EJKMp6LLA/qZSqyEmsJJ_Rr7f9toRbwQ,human-bot,Medium,6,1,559,93.166667,0.710495,534,17,2,6,46,0,46,0,0,36.0,40.0
534,vjYCi8YxMpUj_LaqRdiCXw/wPCNTC9mtdWb8MKJVr439g,human,Large,28,0,10705,382.321429,0.743961,4230,2337,3997,141,0,0,0,0,0,4.0,21.0
535,vlLrA8LGOcUxkQuGbs4TqA/LbQfqlh-Ihko3_Yii02dhQ,human,Medium,5,0,657,131.400000,0.264231,483,134,40,0,0,0,0,0,0,3.0,74.0
536,vpAJthlySeoTSTCzS0iH9w/co9Uzr_rNRVxqxS0x1UpqA,human-bot,Medium,4,1,143,35.750000,0.456294,119,10,14,0,95,94,1,0,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603,zTj5tlMWgotzJmQl7BP8wQ/iQ914_smScbUO8BI9JlE6A,human-bot,Small,3,1,855,285.000000,0.474854,423,59,373,0,26,0,26,0,0,,
604,zUBexdmYylGGpxiebXm6gg/sJXD2kulWzU35ijdY3SnBQ,human,Small,2,0,63,31.500000,0.436508,63,0,0,0,0,0,0,0,0,,
605,zVSBi-iRKCzLiqFwVt6hbg/8SfUBIOeWUjDoQxeUCX7wQ,human,Medium,5,0,26,5.200000,0.446154,19,5,2,0,0,0,0,0,0,,
606,zVSBi-iRKCzLiqFwVt6hbg/fm_lDWwc8Uu-aZ24BjUNZg,human,Medium,5,0,13,2.600000,0.246154,8,4,1,0,0,0,0,0,0,8.0,10.0


Great, now that we've got the cross validation test out of the way, let's now move over and select columns that ar numberic and save it as a new DataFrame! Let's take another quick look at the variables within the 'teams' dataframe:

In [21]:
teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   name_h                    608 non-null    category
 1   Team_type                 608 non-null    category
 2   Team_size_class           608 non-null    category
 3   human_members_count       608 non-null    int64   
 4   bot_members_count         608 non-null    int64   
 5   human_work                608 non-null    int64   
 6   work_per_human            608 non-null    float64 
 7   human_gini                608 non-null    float64 
 8   human_Push                608 non-null    int64   
 9   human_IssueComments       608 non-null    int64   
 10  human_PRReviewComment     608 non-null    int64   
 11  human_MergedPR            608 non-null    int64   
 12  bot_work                  608 non-null    int64   
 13  bot_Push                  608 non-null    int64   

We can see that there are two types of numeric data types in the dataframe: 'int64' and 'float64'. With this information in mind, let's now move forward with selecting these columns and moving them to a new dataframe!

In [67]:
    numeric_only = teams.select_dtypes(include=['int64','float64'])
    numeric_only.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   human_members_count       608 non-null    int64  
 1   bot_members_count         608 non-null    int64  
 2   human_work                608 non-null    int64  
 3   work_per_human            608 non-null    float64
 4   human_gini                608 non-null    float64
 5   human_Push                608 non-null    int64  
 6   human_IssueComments       608 non-null    int64  
 7   human_PRReviewComment     608 non-null    int64  
 8   human_MergedPR            608 non-null    int64  
 9   bot_work                  608 non-null    int64  
 10  bot_Push                  608 non-null    int64  
 11  bot_IssueComments         608 non-null    int64  
 12  bot_PRReviewComment       608 non-null    int64  
 13  bot_MergedPR              608 non-null    int64  
 14  eval_survi

Awesome! Now we have our numeric only dataframe! There is one issue, we want to now remove the columns 'bot_PRReviewComment' and 'bot_MergedPR' columns from the dataframe. Let's get these columns out of the dataframe:

In [68]:
numeric_only = numeric_only.drop(['bot_PRReviewComment','bot_MergedPR'], axis = 1)

Now let's print off the columns to ensure that they are gone!

In [69]:
numeric_only.columns

Index(['human_members_count', 'bot_members_count', 'human_work',
       'work_per_human', 'human_gini', 'human_Push', 'human_IssueComments',
       'human_PRReviewComment', 'human_MergedPR', 'bot_work', 'bot_Push',
       'bot_IssueComments', 'eval_survival_day_median', 'issues_count'],
      dtype='object')

Now moving back over to the 'teams' dataframe. Let's now take 'Team_size_class' and 'human_members_count' and save them to their own dataframe:

In [82]:
team_size_and_members = teams.iloc[:,[2,3]]

Now that we have put these two columns into their own dataframe, let's now rename the columns to make it a little neater:

In [85]:
team_size_and_members.columns = ['Team Size', 'Number of Members']
team_size_and_members.columns

Index(['Team Size', 'Number of Members'], dtype='object')

Thank you so much for going through my instructor lab for week 7. I hope you enjoyed :)