# DataFrame and Series: 

From the notebook __pandas_demo__ we had a glimpse on how a DataFrame looks like, we got a basic grasp of it as a two dimensional representation of a tabular data. Let's explore how a DataFrame represents data in terms of pure Python.   
   
Imagine we have a dictionary that holds information about a person

In [1]:
person1 = {
    "first": 'John',
    "last": 'Doe',
    "email": 'johndoe@email.com'
}

We can look at `person1` dictionary as a table that its columns are `first` , `last` , `email`. But what if we have other dictionaries that we want to represent in this table, like:

In [2]:
person2 = {
    "first": 'Mary',
    "last": 'Jane',
    "email": 'maryjane@email.com'
}

person3 = {
    "first": 'Joe',
    "last": 'Rogan',
    "email": 'joerogan@email.com'
}

In this case we can merge all these three dictionaries into one dictionary that have the same keys as each one of them. `first`, `last`, `email`. But values are going to be a `list of first names`, a `list of last names` and a `list of emails`.  

In [3]:
people = {
    "first": ['John', 'Mary', 'Joe'],
    "last": ['Doe', 'Jane', 'Rogan'],
    "email": ['johndoe@email.com', 'maryjane@email.com', 'joerogan@email.com']
}

Now the structure of `people` dictionary can be used to intiate a DataFrame in pandas, which is a tabular representation of the data of `people` dictionaries in the shape of `people` dictionary.  
   
   
First of all, we import pandas and give it an alias name pd.

In [4]:
import pandas as pd

Use the constructor function `pd.DataFrame()` to initiate a DataFrame

In [5]:
df = pd.DataFrame(people)

In [6]:
df

Unnamed: 0,first,last,email
0,John,Doe,johndoe@email.com
1,Mary,Jane,maryjane@email.com
2,Joe,Rogan,joerogan@email.com


As we can see, the DataFrame __df__ consists of columns with names that are the keys of `people`'s dictionary, in addition to an unnamed column at the beginning that represents indexing.

We can access each column of the __df__ by one of two notations.   
- bracket notation:   
    `df['email']`   
- dot notation:   
    `df.email`

In [7]:
df['email']

0     johndoe@email.com
1    maryjane@email.com
2    joerogan@email.com
Name: email, dtype: object

In [8]:
df.email

0     johndoe@email.com
1    maryjane@email.com
2    joerogan@email.com
Name: email, dtype: object

each column of __df__ is a `Series`. We can look at __df__ as a group of `Series` set together in a one data structure.    
We can verify the `type` of `df['email']` by calling the function `type()` as follows:

In [9]:
type(df['email'])

pandas.core.series.Series

> Series is a list of data but with a functionality mor than only a list of data.

a `DataFrame` is a set of rows and columns but a `Series` is a rows of a single column. Therefore, we can look at a `DataFrame` as a container of muliple `Series` objects.    
As we see from the result of `df['email']`, a Series includes an index column beside the main column. 

######  Access multiple columns in a DataFrame    
To access multiple columns of a `DataFrame` pass to the `DataFrame` bracket notation a list of the columns you want to access. Say we want to access `last` name column and `email` column in `df` `DataFrame`, we do it as follows:

In [10]:
df[['last', 'email']]

Unnamed: 0,last,email
0,Doe,johndoe@email.com
1,Jane,maryjane@email.com
2,Rogan,joerogan@email.com


Notice that the output of `df[['last', 'email']]` is not a Series any more as a `Series` is a single column, and we have here a filtered out `DataFrame` that does not include `first` column.

######  acccess the list of column names    
We can access the list of DataFrame's columns names by using the __property__ `pd.columns`, as follows:

In [11]:
df.columns

Index(['first', 'last', 'email'], dtype='object')

### Access rows     
To access rows of a `DataFrame` there are two `methods`:    
- `iloc[]` which allows us to access rows with 'integer location', it takes as `args`:    
    - index of a row in case we want to retrieve only one row ----> `iloc[row_index]`.
    - a list of row indexes, in case we want to retrieve several rows ----> `iloc[[row1, row2, ..., row_n]]`
    - as a second `arg` `iloc` accept an optional column index, in case we want to retrieve a Series of specific rows.  
        
        ----> `iloc[row_index,column_index]` __or__    
        
        ----> `iloc[[row1, row2, ..., rown], column_index]`
    - also as a second `arg` `iloc` accept an optional list of column indexes, in case we want to retrieve rows with a specific set of columns.      
    
     ----> `iloc[row_index,[column_1, column_2, ..., column_n]]` __or__     
     
     ----> `iloc[[row1, row2, ..., rown], [column_1, column_2, ..., column_n]]`    
- `loc[]` which allows access rows and columns by labels.    
    ----> `loc[row_index, column_label]`    
    ----> `loc[row1, row2, ..., rown], column_label]`    
    ----> `loc[[row1, row2, ..., rown], [column_label_1, column_label_2, ..., column_label_n]]`    
    There are scenarios where we can use rows labels, we will discuss them in the next notebook.

In the next cell, `df.iloc[0]` accesses the row at index: 0 ---> the result is indexed by columns labels, and it returns the values corresponding to each column for the row at index: 0

In [12]:
df.iloc[0]

first                 John
last                   Doe
email    johndoe@email.com
Name: 0, dtype: object

In the next cell, `df.iloc[[0,1], 2]` accesses the set of rows at `index: 0` and `index: 1` of the Series with column_index:2 "email"

In [13]:
df.iloc[[0,1], 2]

0     johndoe@email.com
1    maryjane@email.com
Name: email, dtype: object

In the next cell, we see how `pd.loc` differs from `pd.iloc` as `pd.loc` allows us to use lables instead of integers locations __"indexes"__.    

In [14]:
df.loc[[0,1], 'email']

0     johndoe@email.com
1    maryjane@email.com
Name: email, dtype: object

Let's apply what we learned so far on the stackoverflow_survey_data we have.

In [15]:
survey_df = pd.read_csv('data/survey_results_public.csv')

In [16]:
survey_df

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
4,5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88878,88377,,Yes,Less than once a month but more than once per ...,The quality of OSS and closed source software ...,"Not employed, and not looking for work",Canada,No,Primary/elementary school,,...,,Tech articles written by other developers;Tech...,,Man,No,,,No,Appropriate in length,Easy
88879,88601,,No,Never,The quality of OSS and closed source software ...,,,,,,...,,,,,,,,,,
88880,88802,,No,Never,,Employed full-time,,,,,...,,,,,,,,,,
88881,88816,,No,Never,"OSS is, on average, of HIGHER quality than pro...","Independent contractor, freelancer, or self-em...",,,,,...,,,,,,,,,,


let's read a `Series` of `Hobbyist` from `survey_df`

In [17]:
survey_df['Hobbyist']

0        Yes
1         No
2        Yes
3         No
4        Yes
        ... 
88878    Yes
88879     No
88880     No
88881     No
88882    Yes
Name: Hobbyist, Length: 88883, dtype: object

to see the response of the first repondant about `Hobbyist` question

In [18]:
survey_df.loc[0, 'Hobbyist']

'Yes'

To count the frequency of responses in `survey_df['Hobbyist']`, "How many __Yes__ and how many __No__", we can define our own function to count them, or we can import the function `count()` from `itertools` module, or there is a special function in `pandas` `pd.value_count()` 

In [19]:
def count_frequency(words_list):
    frequency = {}
    for word in words_list:
        if word in frequency:
            frequency[word] = frequency[word] + 1
        else:
            frequency[word] = 1
    return frequency

In [20]:
count_frequency(survey_df['Hobbyist'])

{'Yes': 71257, 'No': 17626}

__OR__

In [21]:
survey_df['Hobbyist'].value_counts()

Yes    71257
No     17626
Name: Hobbyist, dtype: int64

with `pd.loc` and `pd.iloc` we can use __slicing__ like __list slicing__ in __standard Python__, the difference is that slicing in Pandas is __end term inclusive__. mean that the slicing start from the start term including the end term.    
__Note__: when using slicing do not put slicing expression in list square brackets.     
__Example__: imagine we want to display the first three rows of the Series of the column `Hobbyist`

In [22]:
survey_df.loc[0:2, 'Hobbyist']

0    Yes
1     No
2    Yes
Name: Hobbyist, dtype: object

Slicing in Pandas is not reduced only for integer indexes, we can slice by labels too.     
__Example__: imagine we want to show the results of the first three respondants for the columns from `Hobbyist` to `Employment`.   

Let's first show all columns to have a look.

In [23]:
survey_df.columns

Index(['Respondent', 'MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource',
       'Employment', 'Country', 'Student', 'EdLevel', 'UndergradMajor',
       'EduOther', 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode',
       'YearsCodePro', 'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney',
       'MgrWant', 'JobSeek', 'LastHireDate', 'LastInt', 'FizzBuzz',
       'JobFactors', 'ResumeUpdate', 'CurrencySymbol', 'CurrencyDesc',
       'CompTotal', 'CompFreq', 'ConvertedComp', 'WorkWeekHrs', 'WorkPlan',
       'WorkChallenge', 'WorkRemote', 'WorkLoc', 'ImpSyn', 'CodeRev',
       'CodeRevHrs', 'UnitTests', 'PurchaseHow', 'PurchaseWhat',
       'LanguageWorkedWith', 'LanguageDesireNextYear', 'DatabaseWorkedWith',
       'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'WebFrameWorkedWith',
       'WebFrameDesireNextYear', 'MiscTechWorkedWith',
       'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers',
       'BlockchainOrg', 'BlockchainIs', 'BetterLife'

from the columns list, the range of columns from `Hobbyist` to `Employment` is `['Hobbyist', 'OpenSourcer', 'OpenSource',
'Employment']`.   

our goal is to show the first three rows with this range of columns

In [24]:
survey_df.loc[0:2, 'Hobbyist': 'Employment']

Unnamed: 0,Hobbyist,OpenSourcer,OpenSource,Employment
0,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work"
1,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work"
2,Yes,Never,The quality of OSS and closed source software ...,Employed full-time
