# Indexing in pandas

we are going to use the same simple DataFrame snippet of `people` like in the previous notebook __DataFrame and Series__

In [1]:
import pandas as pd

In [2]:
people = {
    "first": ['John', 'Mary', 'Joe'],
    "last": ['Doe', 'Jane', 'Rogan'],
    "email": ['johndoe@email.com', 'maryjane@email.com', 'joerogan@email.com']
}

In [3]:
df = pd.DataFrame(people)

In [4]:
df

Unnamed: 0,first,last,email
0,John,Doe,johndoe@email.com
1,Mary,Jane,maryjane@email.com
2,Joe,Rogan,joerogan@email.com


As we can see, the __df__ `DataFrame` contains an unlabled column of integers at the beggining, which they are integer identifiers for the rows. These integer identifiers are the default rows indexing in Pandas and because they are sequential, they are unique, but in fact pandas does not inforce indexes to be unique.    
### set one of the DataFrame's columns as a row_index:    
We can change the default integer rows indexing to use one of the columns alternatively. we use the function    
`DataFrame.set_index(arg)` it takes an argument `arg` as `column_to_use_for_indexing`

In [5]:
df.set_index('email')

Unnamed: 0_level_0,first,last
email,Unnamed: 1_level_1,Unnamed: 2_level_1
johndoe@email.com,John,Doe
maryjane@email.com,Mary,Jane
joerogan@email.com,Joe,Rogan


The result of the function `df.set_index(arg)` is not `inplace` (it is only a preview of how the new indexing looks like, and if we checked the __df__ DataFrame, we will find it with the default integer indexing).    
To set the new indexing permenantly, we have to set the parameter `inplace` to `True` 

In [6]:
df

Unnamed: 0,first,last,email
0,John,Doe,johndoe@email.com
1,Mary,Jane,maryjane@email.com
2,Joe,Rogan,joerogan@email.com


In [7]:
df.set_index('email', inplace = True)

In [8]:
df

Unnamed: 0_level_0,first,last
email,Unnamed: 1_level_1,Unnamed: 2_level_1
johndoe@email.com,John,Doe
maryjane@email.com,Mary,Jane
joerogan@email.com,Joe,Rogan


In [11]:
df.index

Index(['johndoe@email.com', 'maryjane@email.com', 'joerogan@email.com'], dtype='object', name='email')

Now we can use `pd.loc` to access rows with labels

In [12]:
df.loc['johndoe@email.com']

first    John
last      Doe
Name: johndoe@email.com, dtype: object

### Reset index to default

After setting the indexing of a `DataFrame` to one of the `DataFrame`'s columns it is not allowed to use integer indexing with `pd.loc`anymore.    
To use integer indexing, use only `pd.iloc`.    
To get rid of a custom set rows indexing, use `DataFrame.reset_index(inplace = True)`

In [13]:
df.reset_index()

Unnamed: 0,email,first,last
0,johndoe@email.com,John,Doe
1,maryjane@email.com,Mary,Jane
2,joerogan@email.com,Joe,Rogan


In [14]:
df

Unnamed: 0_level_0,first,last
email,Unnamed: 1_level_1,Unnamed: 2_level_1
johndoe@email.com,John,Doe
maryjane@email.com,Mary,Jane
joerogan@email.com,Joe,Rogan


In [15]:
df.reset_index(inplace = True)

In [16]:
df

Unnamed: 0,email,first,last
0,johndoe@email.com,John,Doe
1,maryjane@email.com,Mary,Jane
2,joerogan@email.com,Joe,Rogan


Now, it is the time to use what we have just learned on the stackoverflow_survey_data.    
It is recommended if we want to set a custom indexing to do it at the stage of loading data instead of use `pd.set_index` function.    
To set a custom index with loading data, set the parameter `index_col` to the column you want to use for indexing.

In [18]:
survey_df = pd.read_csv('data/survey_results_public.csv', index_col = 'Respondent')
schema_df = pd.read_csv('data/survey_results_schema.csv')

In [19]:
survey_df

Unnamed: 0_level_0,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,EduOther,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,"Taught yourself a new language, framework, or ...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",Taken an online course in programming or softw...,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88377,,Yes,Less than once a month but more than once per ...,The quality of OSS and closed source software ...,"Not employed, and not looking for work",Canada,No,Primary/elementary school,,"Taught yourself a new language, framework, or ...",...,,Tech articles written by other developers;Tech...,,Man,No,,,No,Appropriate in length,Easy
88601,,No,Never,The quality of OSS and closed source software ...,,,,,,,...,,,,,,,,,,
88802,,No,Never,,Employed full-time,,,,,,...,,,,,,,,,,
88816,,No,Never,"OSS is, on average, of HIGHER quality than pro...","Independent contractor, freelancer, or self-em...",,,,,,...,,,,,,,,,,


In [21]:
survey_df.loc[1]

MainBranch                 I am a student who is learning to code
Hobbyist                                                      Yes
OpenSourcer                                                 Never
OpenSource      The quality of OSS and closed source software ...
Employment                 Not employed, and not looking for work
                                      ...                        
Sexuality                                 Straight / Heterosexual
Ethnicity                                                     NaN
Dependents                                                     No
SurveyLength                                Appropriate in length
SurveyEase                             Neither easy nor difficult
Name: 1, Length: 84, dtype: object

A real world use case of the use a custom indexing is, if we want to know what does a __column name__ in `survey_df` means, we have to consulte the `schema_df` `DataFrame` to see what question does this column reponds to. Instead of doing this operation manually, we can index `schema_df` by __column name__ and then whenever we want to know what does a column name means we use __column name__ and access `schema_df` to get the question corresponding to it.

In [22]:
schema_df

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,MainBranch,Which of the following options best describes ...
2,Hobbyist,Do you code as a hobby?
3,OpenSourcer,How often do you contribute to open source?
4,OpenSource,How do you feel about the quality of open sour...
...,...,...
80,Sexuality,Which of the following do you currently identi...
81,Ethnicity,Which of the following do you identify as? Ple...
82,Dependents,"Do you have any dependents (e.g., children, el..."
83,SurveyLength,How do you feel about the length of the survey...


In [24]:
schema_df.set_index('Column', inplace= True)

In [25]:
schema_df

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Respondent,Randomized respondent ID number (not in order ...
MainBranch,Which of the following options best describes ...
Hobbyist,Do you code as a hobby?
OpenSourcer,How often do you contribute to open source?
OpenSource,How do you feel about the quality of open sour...
...,...
Sexuality,Which of the following do you currently identi...
Ethnicity,Which of the following do you identify as? Ple...
Dependents,"Do you have any dependents (e.g., children, el..."
SurveyLength,How do you feel about the length of the survey...


In [31]:
schema_df.loc['OpenSource']

QuestionText    How do you feel about the quality of open sour...
Name: OpenSource, dtype: object

As we can see from the result above, that long questions come out truncated. Although there are some global variable that could be tweaked in order to show truncated results fully, but the easiest way is to access our targeted cell by row and column together. 

In [32]:
schema_df.loc['OpenSource', 'QuestionText']

'How do you feel about the quality of open source software (OSS)?'

### Index Sorting option

Another option that ease the manual search throught schema_df is sorting the indexing column alphapetically, using `DataFrame.sort_index()` function.

In [35]:
schema_df.sort_index()

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Age,What is your age (in years)? If you prefer not...
Age1stCode,At what age did you write your first line of c...
BetterLife,Do you think people born today will have a bet...
BlockchainIs,Blockchain / cryptocurrency technology is prim...
BlockchainOrg,How is your organization thinking about or imp...
...,...
WorkPlan,How structured or planned is your work?
WorkRemote,How often do you work remotely?
WorkWeekHrs,"On average, how many hours per week do you work?"
YearsCode,"Including any education, how many years have y..."


In [36]:
schema_df

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Respondent,Randomized respondent ID number (not in order ...
MainBranch,Which of the following options best describes ...
Hobbyist,Do you code as a hobby?
OpenSourcer,How often do you contribute to open source?
OpenSource,How do you feel about the quality of open sour...
...,...
Sexuality,Which of the following do you currently identi...
Ethnicity,Which of the following do you identify as? Ple...
Dependents,"Do you have any dependents (e.g., children, el..."
SurveyLength,How do you feel about the length of the survey...


From the previous two cells , it is clear that `pd.sort_index()` needs `inplace = True`. In addition to that we can set a parameter `ascending` to False to order data `descendingly`

In [37]:
schema_df.sort_index(ascending= False)

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
YearsCodePro,How many years have you coded professionally (...
YearsCode,"Including any education, how many years have y..."
WorkWeekHrs,"On average, how many hours per week do you work?"
WorkRemote,How often do you work remotely?
WorkPlan,How structured or planned is your work?
...,...
BlockchainOrg,How is your organization thinking about or imp...
BlockchainIs,Blockchain / cryptocurrency technology is prim...
BetterLife,Do you think people born today will have a bet...
Age1stCode,At what age did you write your first line of c...


In [38]:
schema_df.sort_index(inplace=True)

In [39]:
schema_df

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Age,What is your age (in years)? If you prefer not...
Age1stCode,At what age did you write your first line of c...
BetterLife,Do you think people born today will have a bet...
BlockchainIs,Blockchain / cryptocurrency technology is prim...
BlockchainOrg,How is your organization thinking about or imp...
...,...
WorkPlan,How structured or planned is your work?
WorkRemote,How often do you work remotely?
WorkWeekHrs,"On average, how many hours per week do you work?"
YearsCode,"Including any education, how many years have y..."
