#**Module 3: Data Wrangling Part 1**
In this module, you will learn how to
* Create dataframes with data subsets
* Reduce the number of columns and rows in a dataset
* Explain why reducing data is so important

**Be sure to expand all the hidden cells, run all the code, and do all the exercises--you will need the techniques for the lesson lab!**

##**The Goal**
Big datasets are, well, big. We aren't talking about thousands of rows and 10 or so attributes (aka features, dimensions, or variables); we are talking about billions of rows with many thousands of dimensions. Think all-purchases-on-Amazon.com-the-Saturday-before-Christmas big.

If you try to process that much data on your computer, guess what happens?

<img src="https://t4.ftcdn.net/jpg/05/59/39/65/360_F_559396565_1OMlX6HmsVNNuTLIBKXFQibqycB2vXQg.jpg">

I will neither confirm nor deny that I may be speaking from experience. Ahem.

So, in terms of big datasets, **BIGGER** does **NOT** mean **BETTER**. Instead, you'll want to think critically about what data from the big pool you need for your analysis and then work with only that subset.


#**0. Preparation and Setup**
We are working with our adult dataset again, so we're loading our libraries and our dataset just like last time, only with the url variable, which simplifies any dataset import for later.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Reading in the data as adult dataframe
adult = pd.read_csv("https://raw.githubusercontent.com/shstreuber/Data-Mining/master/data/adult.data.simplified25.csv")

#Verifying that we can see the data
adult.head()

Unnamed: 0,age,workclass,education,educationyears,maritalstatus,occupation,relationship,race,sex,hoursperweek,nativecountry,incomeUSD
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,43747
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,38907
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,25055
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,26733
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,23429


#**1. Subsetting and Aggregating Data**


So, the idea is to reduce the dataset with which you are working to the smallest possible size and include ONLY the attributes AND the rows
that are useful and meaningful. This requires editing your dataset:

1. When you reduce the number of attributes or the number of rows, this is called **SUBSETTING**
3. When you summarize rows based on a common attribute (like, "all married people" or "all people under 17"), that is called **AGGREGATION**

Both together are called **DIMENSIONALITY REDUCTION**

The goal is to think critically (and do some math) about what the best attributes and rows are to include BEFORE you start processing your data. In other words: **PRE-PROCESSING**.

Let's get started!

##**1.1 Slicing and Subsetting**
Slicing and subsetting are related. While Slicing requires indexing (i.e. using the absolute row and column numbers, starting with 0), subsetting does not require indexing.

##**Slicing Magic: The iloc Operator**

**iloc** allows you to define exactly the "fields" that you want to see:

* `df.iloc[0:5,]` shows you the ROWS with the indices 0 through 4 (the first 5)
* `df.iloc[:,0:5]` shows you the first five COLUMNS
* `df.iloc[0:5,0:5]` shows you the first five ROWS and the first five COLUMNS

Practice this below (remember to click on the "start" button to execute the code):

In [2]:
# Looking at only the first 5 rows
adult.iloc[0:5,]

Unnamed: 0,age,workclass,education,educationyears,maritalstatus,occupation,relationship,race,sex,hoursperweek,nativecountry,incomeUSD
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,43747
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,38907
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,25055
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,26733
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,23429


Now, use the field below to display only the first five columns

In [3]:
adult.iloc[:,0:5]

Unnamed: 0,age,workclass,education,educationyears,maritalstatus
0,39,State-gov,Bachelors,13,Never-married
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse
2,38,Private,HS-grad,9,Divorced
3,53,Private,11th,7,Married-civ-spouse
4,28,Private,Bachelors,13,Married-civ-spouse
...,...,...,...,...,...
24995,41,Private,10th,6,Married-civ-spouse
24996,19,Private,HS-grad,9,Never-married
24997,33,Private,HS-grad,9,Divorced
24998,21,?,Some-college,10,Never-married


And now the first five columns and the first five rows:

In [4]:
adult.iloc[0:5,0:5]

Unnamed: 0,age,workclass,education,educationyears,maritalstatus
0,39,State-gov,Bachelors,13,Never-married
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse
2,38,Private,HS-grad,9,Divorced
3,53,Private,11th,7,Married-civ-spouse
4,28,Private,Bachelors,13,Married-civ-spouse


##Your Turn##
Now use the code field below to save the first five columns and the first five rows into their own dataframe like this:

`adult_[yourname]_short = adult.iloc[` ... followed by the actual iloc code and then display the contents:

##**1.2 Subsetting**
Subsetting does not require knowing the index numbers for rows or columns. Instead, it sets up row-based filters. For more about subsetting, click [here](https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/) or continue below.

What if we want to see only the people who are 90 years old?

In [5]:
adult[adult['age'] == 90] # NOTE here that == means "equal to"; it is NOT the mathematical equality operator!

Unnamed: 0,age,workclass,education,educationyears,maritalstatus,occupation,relationship,race,sex,hoursperweek,nativecountry,incomeUSD
222,90,Private,HS-grad,9,Never-married,Other-service,Not-in-family,Black,Male,40,United-States,36040
1040,90,Private,HS-grad,9,Never-married,Other-service,Not-in-family,White,Female,40,United-States,24955
1935,90,Private,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,45,United-States,43154
2303,90,Private,Some-college,10,Never-married,Other-service,Not-in-family,Asian-Pac-Islander,Male,35,United-States,29655
2891,90,Private,Some-college,10,Separated,Adm-clerical,Own-child,White,Female,40,Puerto-Rico,20750
4070,90,Private,11th,7,Never-married,Handlers-cleaners,Own-child,White,Male,40,United-States,47745
4109,90,?,Bachelors,13,Widowed,?,Other-relative,White,Female,10,United-States,28674
5104,90,Private,Some-college,10,Never-married,Other-service,Not-in-family,Asian-Pac-Islander,Male,35,United-States,24933
5272,90,Private,9th,5,Never-married,Adm-clerical,Not-in-family,White,Female,40,United-States,40440
5370,90,Local-gov,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,60,United-States,119802


Now use <= 20 to find all the people who are younger than or equal to 20:

In [8]:
adult[adult['age'] <= 20].shape

(1849, 12)

How about all the people who are older than 75?

In [10]:
adult[adult['age'] > 75].shape

(207, 12)

If you are subsetting with strings, the strings need to be in single or double quotes. Below, we are looking for everyone with a Bachelors degree:

In [11]:
adult[adult['education'] == 'Bachelors']

Unnamed: 0,age,workclass,education,educationyears,maritalstatus,occupation,relationship,race,sex,hoursperweek,nativecountry,incomeUSD
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,43747
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,38907
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,23429
9,42,Private,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,40,United-States,87200
11,30,State-gov,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,40,India,189843
...,...,...,...,...,...,...,...,...,...,...,...,...
24978,48,Private,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Male,50,United-States,42320
24981,43,Self-emp-inc,Bachelors,13,Married-civ-spouse,Sales,Other-relative,Asian-Pac-Islander,Male,45,India,36018
24983,40,Local-gov,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,75,United-States,199107
24986,33,Private,Bachelors,13,Never-married,Sales,Not-in-family,White,Female,40,United-States,36303


##Your Turn
Find all the people in the United-States. Then, find all the people from Cuba!

##**1.3 Building your own reduced dataset**
Now you have all the tools you need to reduce your dataset (without changing its values--we'll do that in our next step) to a size that's

1. Manageable for your hardware
2. Practical for you to work with

One of the WORST things you can do to yourself, your computer, and your instructor is to keep working with the HUGE dataset. That often leads to confusion and doesn't work well. Use **discernment and critical thinking** when working with data. Your future employer will thank you!

Below are some of the techniquest that will help you:

In [12]:
# Selecting multiple columns at the same time extracts a new DataFrame from your existing DataFrame.
# For selection of multiple columns, the syntax is: square-brace selection with a list of column names,
# e.g. data[['column_name_1', 'column_name_2']]
adult[['age','education','sex']]

Unnamed: 0,age,education,sex
0,39,Bachelors,Male
1,50,Bachelors,Male
2,38,HS-grad,Male
3,53,11th,Male
4,28,Bachelors,Female
...,...,...,...
24995,41,10th,Male
24996,19,HS-grad,Male
24997,33,HS-grad,Female
24998,21,Some-college,Male


In [13]:
reduced_data = adult[['age','education','sex']]

In [14]:
reduced_data

Unnamed: 0,age,education,sex
0,39,Bachelors,Male
1,50,Bachelors,Male
2,38,HS-grad,Male
3,53,11th,Male
4,28,Bachelors,Female
...,...,...,...
24995,41,10th,Male
24996,19,HS-grad,Male
24997,33,HS-grad,Female
24998,21,Some-college,Male


In [15]:
# Alternately, you can use numeric indexing with the iloc selector and a list of column numbers, e.g. data.iloc[:, [0,1,20,22]]
# This allows you to specify colums, as well
adult.iloc[0:5,[0,1,7,10,11]]

Unnamed: 0,age,workclass,race,nativecountry,incomeUSD
0,39,State-gov,White,United-States,43747
1,50,Self-emp-not-inc,White,United-States,38907
2,38,Private,White,United-States,25055
3,53,Private,Black,United-States,26733
4,28,Private,Black,Cuba,23429


**AAAAAAND**, as they say, the piece de resistance: Combining row filters and column filters!

In [16]:
adult[adult['education'] == 'Bachelors'].iloc[0:5,[0,1,2,10,11]]

Unnamed: 0,age,workclass,education,nativecountry,incomeUSD
0,39,State-gov,Bachelors,United-States,43747
1,50,Self-emp-not-inc,Bachelors,United-States,38907
4,28,Private,Bachelors,Cuba,23429
9,42,Private,Bachelors,United-States,87200
11,30,State-gov,Bachelors,India,189843


##Your Turn
Now, complete the code below to build an adult_small dataframe with rows 70-80 and all columns. You will need this dataframe later. (REMEMBER that the index numbers start with 0!!!!!!)

## **1.4. Aggregation and Dimensionality Reduction**
So far, we have learned the mechanics of making our datasets smaller based on practical deliberations--specifically, what columns and what rows we want in order to produce a valid analysis. The goal of Dimensionality Reduction is similar: To make our dataset smaller, so it's easier to handle. However, the reasons are different.

With Dimensionality Reduction, we are looking at **the data themselves** to show us ways in which they can be summarized and simplified. This can happen as follows:
- Column-based: Eliminate attributes that are bascially duplicates of one another
- Row-based: Aggregate similar attribute levels in one level
- Binning and Bucketing
- Normalization of values

We will learn about the first two bullet points below; we will come back to the last two bullet points once we have stepped through data transformation.

###Column-Based Dimensionality Reduction###
In the adult dataset, 'maritalstatus' and 'relationship' look very closely related. If there is a 1:1 relationship between all data values (or at least most of them), this means that the information is really duplicate, so we can choose to eliminate one of these columns.

Let's see if we need both of them.

In [17]:
household=adult[['relationship', 'maritalstatus']]
household.groupby('relationship').sum()

Unnamed: 0_level_0,maritalstatus
relationship,Unnamed: 1_level_1
Husband,Married-civ-spouseMarried-civ-spouseMarried-ci...
Not-in-family,Never-marriedDivorcedMarried-spouse-absentNeve...
Other-relative,Married-civ-spouseNever-marriedNever-marriedNe...
Own-child,Never-marriedNever-marriedNever-marriedNever-m...
Unmarried,Never-marriedDivorcedSeparatedDivorcedNever-ma...
Wife,Married-civ-spouseMarried-civ-spouseMarried-AF...


In [19]:
household.head(10)

Unnamed: 0,relationship,maritalstatus
0,Not-in-family,Never-married
1,Husband,Married-civ-spouse
2,Not-in-family,Divorced
3,Husband,Married-civ-spouse
4,Wife,Married-civ-spouse
5,Wife,Married-civ-spouse
6,Not-in-family,Married-spouse-absent
7,Husband,Married-civ-spouse
8,Not-in-family,Never-married
9,Husband,Married-civ-spouse


The output shows us that we have many different values in 'maritalstatus' that are tied to one value in 'relationship.' In relational database terms (for those of you who have taken a database class), this is basically a many-to-one relationship. What we are looking for is a one-to-one relationship. So, 'maritalstatus' and 'relationship' won't work.

What if we look at this relationship the other way around, using 'maritalstatus' on the left, though?

In [18]:
household.groupby('maritalstatus').sum()

Unnamed: 0_level_0,relationship
maritalstatus,Unnamed: 1_level_1
Divorced,Not-in-familyUnmarriedUnmarriedNot-in-familyOw...
Married-AF-spouse,WifeWifeWifeHusbandHusbandHusbandWifeOwn-child...
Married-civ-spouse,HusbandHusbandWifeWifeHusbandHusbandHusbandHus...
Married-spouse-absent,Not-in-familyNot-in-familyUnmarriedNot-in-fami...
Never-married,Not-in-familyNot-in-familyOwn-childNot-in-fami...
Separated,UnmarriedUnmarriedOwn-childUnmarriedOther-rela...
Widowed,UnmarriedUnmarriedNot-in-familyNot-in-familyUn...


Let's try setting a matrix that shows all unique combinations of 'relationship' and 'maritalstatus'. We will use relationship as index and use apply and lambda to sort maritalstatus according to that index.

In [20]:
household2 = household.groupby('relationship').apply(lambda x: x['maritalstatus'].unique())
household2

relationship
Husband                     [Married-civ-spouse, Married-AF-spouse]
Not-in-family     [Never-married, Divorced, Married-spouse-absen...
Other-relative    [Married-civ-spouse, Never-married, Separated,...
Own-child         [Never-married, Divorced, Married-civ-spouse, ...
Unmarried         [Never-married, Divorced, Separated, Widowed, ...
Wife                        [Married-civ-spouse, Married-AF-spouse]
dtype: object

Can you turn this around and use 'maritalstatus' as index? Use the field below.

In [21]:
household.groupby('maritalstatus').apply(lambda x: x['relationship'].unique())

maritalstatus
Divorced                 [Not-in-family, Unmarried, Own-child, Other-re...
Married-AF-spouse               [Wife, Husband, Own-child, Other-relative]
Married-civ-spouse       [Husband, Wife, Own-child, Other-relative, Not...
Married-spouse-absent    [Not-in-family, Unmarried, Own-child, Other-re...
Never-married            [Not-in-family, Own-child, Unmarried, Other-re...
Separated                [Unmarried, Own-child, Other-relative, Not-in-...
Widowed                  [Unmarried, Not-in-family, Own-child, Other-re...
dtype: object

So, we've looked at the connection between 'relationship' and 'maritalstatus' from all different sides--and we are still finding these one-to-many relationships that go both ways. Unless we join each unique value from one attribute with each unique value in the other attribute into the same column, we will need to keep both columns.

Let's see if there is a better connection between 'educationyears' and 'education.'

In [22]:
degree=adult[['educationyears', 'education']]
# degree.sort_values('educationyears')  # This gives us the entire list sorted, but we want to display the unique values
# degree.groupby('educationyears').sum() # That's what we had before--we can do better!

# Let's try setting a matrix that is indexed by educationyears. This is what apply and lambda x do.
degree2 = degree.groupby('educationyears').apply(lambda x: x['education'].unique())
degree2

educationyears
1        [Preschool]
2          [1st-4th]
3          [5th-6th]
4          [7th-8th]
5              [9th]
6             [10th]
7             [11th]
8             [12th]
9          [HS-grad]
10    [Some-college]
11       [Assoc-voc]
12      [Assoc-acdm]
13       [Bachelors]
14         [Masters]
15     [Prof-school]
16       [Doctorate]
dtype: object

##Your Turn

Can you turn this around and use 'education' as index?

In [23]:
degree.groupby('education').apply(lambda x: x['educationyears'].unique())

education
10th             [6]
11th             [7]
12th             [8]
1st-4th          [2]
5th-6th          [3]
7th-8th          [4]
9th              [5]
Assoc-acdm      [12]
Assoc-voc       [11]
Bachelors       [13]
Doctorate       [16]
HS-grad          [9]
Masters         [14]
Preschool        [1]
Prof-school     [15]
Some-college    [10]
dtype: object

In contrast to 'maritalstatus' and 'relationship, it seems that 'educationyears' and 'education' are uniquely related. This means we need only one of these columns. Since working with numbers is always easier, we choose 'educationyears' and will eliminate 'education'.
To drop a columns, we can use a couple of methods:
1. We can rebuild the dataframe (or a different dataframe) with only the columns that we want, for example: `adult4=adult[['age','race','sex','educationyears','income']]`--that kind of thing
2. We can use the pandas drop function as explained here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html. Another explanation with a data frame sample is here: http://cmdlinetips.com/2018/04/how-to-drop-one-or-more-columns-in-pandas-dataframe/.  This is a better way to modify your dataframe.

Now, let's build an adult4 dataframe that contains all columns of the adult dataframe EXCEPT 'education'

In [24]:
adult4 = adult.drop(['education'], axis = 1)
adult4

Unnamed: 0,age,workclass,educationyears,maritalstatus,occupation,relationship,race,sex,hoursperweek,nativecountry,incomeUSD
0,39,State-gov,13,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,43747
1,50,Self-emp-not-inc,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,38907
2,38,Private,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,25055
3,53,Private,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,26733
4,28,Private,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,23429
...,...,...,...,...,...,...,...,...,...,...,...
24995,41,Private,6,Married-civ-spouse,Transport-moving,Husband,White,Male,60,United-States,26739
24996,19,Private,9,Never-married,Farming-fishing,Own-child,White,Male,40,United-States,43783
24997,33,Private,9,Divorced,Craft-repair,Own-child,White,Female,42,United-States,22932
24998,21,?,10,Never-married,?,Unmarried,White,Male,40,United-States,34094


### Row-Based Dimensionality Reduction
Along with setting a filter and storing the output in a separate dataframe as we have seen at the beginning of this file, you can also remove rows from a dataframe by using the “drop” function. To do so, you will need to  specify axis=0.

Drop() removes rows based on “labels”, rather than numeric indexing. To delete rows based on their numeric position / index, use iloc to reassign the dataframe values, as in the examples below.

Read more about drop() and axis values (0 or 1) [here](https://www.shanelynn.ie/pandas-drop-delete-dataframe-rows-columns/).

In [25]:
#Delete the rows with label 'white'
#For label-based deletion, set the index first on the dataframe:
adult5 = adult
adult5 = adult5.set_index('race')
adult5.head()

Unnamed: 0_level_0,age,workclass,education,educationyears,maritalstatus,occupation,relationship,sex,hoursperweek,nativecountry,incomeUSD
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
White,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,Male,40,United-States,43747
White,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Male,13,United-States,38907
White,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,Male,40,United-States,25055
Black,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Male,40,United-States,26733
Black,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Female,40,Cuba,23429


In [26]:
# Now we delete the rows where the index shows "White"
adult5 = adult5.drop('White', axis=0) # Delete all rows with label 'White'
adult5.head()

Unnamed: 0_level_0,age,workclass,education,educationyears,maritalstatus,occupation,relationship,sex,hoursperweek,nativecountry,incomeUSD
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Black,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Male,40,United-States,26733
Black,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Female,40,Cuba,23429
Black,49,Private,9th,5,Married-spouse-absent,Other-service,Not-in-family,Female,16,Jamaica,45531
Black,37,Private,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Male,80,United-States,167514
Asian-Pac-Islander,30,State-gov,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Male,40,India,189843


In [27]:
# We can also delete the rows with labels 0,1,5
adult5 = adult.drop([0,1,5], axis=0)
adult5.head()

Unnamed: 0,age,workclass,education,educationyears,maritalstatus,occupation,relationship,race,sex,hoursperweek,nativecountry,incomeUSD
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,25055
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,26733
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,23429
6,49,Private,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,16,Jamaica,45531
7,52,Self-emp-not-inc,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,45,United-States,103612


Check out the results above--notice how indices 0,1, and 5 are missing?

##Your Turn

Now, put everything together that you have learned so far, experiment a bit, and then use the space below to build a new adult6 dataframe that contains only rows of male individuals.

In [28]:
adult.iloc[3,3]

7

In [29]:
adult.head()

Unnamed: 0,age,workclass,education,educationyears,maritalstatus,occupation,relationship,race,sex,hoursperweek,nativecountry,incomeUSD
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,43747
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,38907
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,25055
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,26733
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,23429


In [31]:
adult.iloc[0:5,2]



0    Bachelors
1    Bachelors
2      HS-grad
3         11th
4    Bachelors
Name: education, dtype: object