# Instructor-led Lab: Manipulating Data


  
# Instructions:

In this assignment you will practice utilizing DataFrames in a program. 

For this lab, you will use the [github_teams.csv](/data/github_team.csv) which contains behavioral trace data extracted from GitHub.

## Accessing Data

Using the `github_teams` dataset, please perform the following operations in order:

* Open the file within Python.
* Find out what the column header names are.
* Determine the number of columns.
* Determine the number of rows.
* Determine which columns are categorical and convert them from *object* to *category*.
* How many unique values does `Team_type` have?
* How many unique values does `Team_size_class` have?
* What is the value of the 63rd row and 6th column?
* What are the values for the 300th row?
* Using three different methods, select row with index value 595 with 1st, 2nd, 3rd columns.
* Using two different methods, select the row with index value 46 with the 3rd and 7th columns.
* Create a new DataFrame for the column `bot_work` using two different methods.

## Sorting and Ordering data 

Now that you have learned to subsample data, it is your turn to apply your new knowledge. Using the `github_teams` dataset, please perform the following operations in order:

* Select `human-bot` teams that have a `bot_members_count` value greater than and equal to 2.
* Find the `human` teams that are `Large` and have a `human_gini` value greater than and equal to 0.75.
* How many teams are in the `Small` or `Large` category?
* How many teams are in the `Small` or `Large` cateogry with a `human_gini` value less than and equal to 0.20?
* How many `human-bot` teams are in the `Medium` category?
* Create a subsample of 50% of your data.
* Create samples for a 8-fold cross validation test.
* Select columns that are numeric and save it as a new DataFrame.
* Remove the columns `bot_PRReviewComment` and `bot_MergedPR` from the DataFrame.
* Save the columns `Team_size_class` and `human_members_count` as a new DataFrame.
* Rename these two columns in the new DataFrame.

Save your notebook with output displayed within it and submit for grading.

### Import Modules

In [109]:
#Import the various modules into Jupyter Notebook so that they can be used later
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

Great! We have the various modules imported into Jupyter Notebook! Now we need to read in the two .csv files which contain the necessary data!  

### Accessing Data:

In [5]:
#First, let's go through and find our current working directory, and ensure that we are using the appropriate working directory!
os.getcwd()

'C:\\Users\\spencer.simpson\\Documents\\python_class\\week_7'

Awesome! We can see that the current working directory is the appropriate one with all of the necessary files!  
Now, let's read in the data!  

There is only one .csv file that needs to be read into Jupyter Notebook, that being the github_teams.csv file. This file can be found in the "\data\" directory within the "\week_7\" directory. 

In [15]:
teams = pd.read_csv('data\\github_teams.csv')

Awesomesauce! We now have the .csv file in Jupyter Notebook! Let's begin our work by taking a quick look at the file and find out what the header names are using the '.columns' function!  

### Inspect Columns

In [231]:
for column_name in teams.columns:
    print(column_name)

name_h
Team_type
Team_size_class
human_members_count
bot_members_count
human_work
work_per_human
human_gini
human_Push
human_IssueComments
human_PRReviewComment
human_MergedPR
bot_work
bot_Push
bot_IssueComments
bot_PRReviewComment
bot_MergedPR
eval_survival_day_median
issues_count


Wow! That is a lot of columns. Let's count how many there are!

In [49]:
print(f"Number of Columns: {teams.shape[1]}") #.Shape returns the height and the width of a dataframe, the index [1] contains the width.

Number of Columns: 19


Now we know that there are 19 different columns within the teams dataframe. Why don't we now take a look at the height of the dataframe to figure out how many rows there are?

In [51]:
print(f"Number of Rows: {teams.shape[0]}") #.shape returns the height and the width of a dataframe, the index [0] contains the height.

Number of Rows: 608


There are 608 rows within the teams dataframe.  

Now that we have an idea of the size of the teams dataframe, why don't we take a closer look at the columns and what type they are? 

In [63]:
teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   name_h                    608 non-null    object 
 1   Team_type                 608 non-null    object 
 2   Team_size_class           608 non-null    object 
 3   human_members_count       608 non-null    int64  
 4   bot_members_count         608 non-null    int64  
 5   human_work                608 non-null    int64  
 6   work_per_human            608 non-null    float64
 7   human_gini                608 non-null    float64
 8   human_Push                608 non-null    int64  
 9   human_IssueComments       608 non-null    int64  
 10  human_PRReviewComment     608 non-null    int64  
 11  human_MergedPR            608 non-null    int64  
 12  bot_work                  608 non-null    int64  
 13  bot_Push                  608 non-null    int64  
 14  bot_IssueC

Above we can see the 19 different columns in the teams dataframe. On the left-hand side we can see the names of the columns, and on the right we can see the data types for each individual column. We can see that there are three different categorical columns: 'name_h', 'Team_type', and 'Team_size_class'. Let's now address them properly by changing the data type over from 'object' to 'category'.

In [73]:
# We need to go through the three different categorical columns and convert them to the 'category' data type

teams['name_h'] = teams['name_h'].astype('category')
teams['Team_type'] = teams['Team_type'].astype('category')
teams['Team_size_class'] = teams['Team_size_class'].astype('category')

# Now that let's confirm the changes to our data frame

teams.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   name_h                    608 non-null    category
 1   Team_type                 608 non-null    category
 2   Team_size_class           608 non-null    category
 3   human_members_count       608 non-null    int64   
 4   bot_members_count         608 non-null    int64   
 5   human_work                608 non-null    int64   
 6   work_per_human            608 non-null    float64 
 7   human_gini                608 non-null    float64 
 8   human_Push                608 non-null    int64   
 9   human_IssueComments       608 non-null    int64   
 10  human_PRReviewComment     608 non-null    int64   
 11  human_MergedPR            608 non-null    int64   
 12  bot_work                  608 non-null    int64   
 13  bot_Push                  608 non-null    int64   

Sweet! Our three categorical variables have succesfully been converted to the 'category' data type!  
Now let's go through and see how many unique values the 'Team_type' and 'Team_size_class' columnes have!

In [83]:
teams.Team_type.unique()  

['human-bot', 'human']
Categories (2, object): ['human', 'human-bot']

We can see that there are two different unique values in the 'Team_type' variable. Now how many unique values does the 'Team_size_class' column contain?

In [85]:
teams.Team_size_class.unique()

['Small', 'Large', 'Medium']
Categories (3, object): ['Large', 'Medium', 'Small']

Sweet! There are three different unique values in the 'Team_size_class' column.  
Now that we have gone through and looked at the shape of the dataframe, the columns names, changed data types, and counted unique values; let's now play around with some indexing.  
First value we need to find is the 63rd row in the 6th column:

In [115]:
teams.iloc[62, 5]

35

Now, let's print out the values that are in row 300:

In [123]:
teams.iloc[299:300]

Unnamed: 0,name_h,Team_type,Team_size_class,human_members_count,bot_members_count,human_work,work_per_human,human_gini,human_Push,human_IssueComments,human_PRReviewComment,human_MergedPR,bot_work,bot_Push,bot_IssueComments,bot_PRReviewComment,bot_MergedPR,eval_survival_day_median,issues_count
299,IyfocAGfAHLncCVJUujqTA/A_QZ6HlUb5sRQHhPa7SGzQ,human-bot,Medium,4,1,1049,262.25,0.448761,739,213,91,6,52,0,52,0,0,27.0,243.0


### Note: Indexing is offset by 1, meaning that the actual value is +1 of the displayed value. This can be seen in the 300th row. Since the rows start at index 0, the 300th row will be displayed as row 299.

Great! Now that we have gone over the basics of indexing, let's now go into different methods for selecting certain values in a data frame.  
For the first section, we will use three different methods to select row with index value 595 with 1st, 2nd, 3rd columns.

In [145]:
teams.iloc[595, 0:3]

name_h             zAh1NECRCquqUJ_-1d6hAw/DET3jTK8hokYfY_neJ1IVQ
Team_type                                              human-bot
Team_size_class                                            Small
Name: 595, dtype: object

In [147]:
teams.loc[595, 'name_h':'Team_size_class']

name_h             zAh1NECRCquqUJ_-1d6hAw/DET3jTK8hokYfY_neJ1IVQ
Team_type                                              human-bot
Team_size_class                                            Small
Name: 595, dtype: object

In [207]:
teams_category = teams.select_dtypes(include = ['category'])
teams_category.iloc[595,:]

name_h             zAh1NECRCquqUJ_-1d6hAw/DET3jTK8hokYfY_neJ1IVQ
Team_type                                              human-bot
Team_size_class                                            Small
Name: 595, dtype: object

Great! Now we have selected specific values using three different methods!  
Let's now continue by using two different methods to select the row with index value 46 with the 3rd and 7th columns.

In [227]:
teams.loc[46, ['human_members_count','human_gini']]

human_members_count           6
human_gini             0.506981
Name: 46, dtype: object

In [229]:
teams.iloc[46, [3,7]]

human_members_count           6
human_gini             0.506981
Name: 46, dtype: object

Awesome! Now that we know how to access the data, let's now go an create a dataframe using two different methods!

### Create New DataFrame

In [251]:
just_bot_work = pd.DataFrame(teams.bot_work)
type(just_bot_work)

pandas.core.frame.DataFrame

In [249]:
only_bot_work = teams.filter(like = 'bot_work')
type(only_bot_work)

pandas.core.frame.DataFrame

### References

https://www.geeksforgeeks.org/python-subset-dataframe-by-column-name/#