### Lesson 2: DataFrame fundamentals and indexing

### Part 2.2.1  : Crafting DataFrames
- The dataframe is most widely used data structure by data scientists
- When we read the data file (csv, excel etc.) using pandas read_csv or read_excel functions, the outcome will be a dataframe.
- We can perform various operations on rows and columns of the dataframe like indexing, subsetting, grouping etc.
- Each column in a dataframe is called a Series.
- Let us now explore the fundamentals of the dataframe.

#### We can create a dataframe by using a combination of data structures like list, dictionary and series. So, let us first understand lists, dictionary and series


###### A list is a data structure in Python which is a sequence of elements, which can be of different types

In [1]:
# create a simple list
ls = [1,2,3.02,'x', 10]

In [2]:
# print the list
print(ls)

[1, 2, 3.02, 'x', 10]


In [3]:
#ls is of type list
print(type(ls))  

<class 'list'>


In [4]:
# reading data from the keyboard and create a list

ls=[]   # create empty list

In [5]:
print(ls)

[]


In [6]:
# we will use the input() function to read the data from keyboard
# we will use a for loop to read multiple elements


for i in range(3):           # the loop gets executed 3 times, as range() function has default starting value as 0
    ls.append(input('Enter a value:'))    #input() function reads the data 

Enter a value:1
Enter a value:2
Enter a value:3


In [7]:
# print the contents of the list    
print(ls)

['1', '2', '3']


**NOTE:**
- List is one of the most widely used the data structures in Python
- There are several built-in methods which are used for operations on lists like append(), extend(), copy, count(), insert(), pop(), sort() etc. We will use them as and when required.

###### A dictionary is a data structure in Python which is a sequence of key:value paired elements

In [8]:
# create a dictionary of name:age pairs
d = {'Mike':25, 'Jack':23, 'Tom':20}
print(d)

{'Mike': 25, 'Jack': 23, 'Tom': 20}


In [9]:
# adding a new pair to the existing dictionary
d['Jerry']=21
print(d)

{'Mike': 25, 'Jack': 23, 'Tom': 20, 'Jerry': 21}


In [10]:
# use items() method to display dictionary contents
print(list(d.items()))

[('Mike', 25), ('Jack', 23), ('Tom', 20), ('Jerry', 21)]


**NOTE:**
- Each record/row in a dataframe is treated as a dictionary, where keys are the column headers and the values are elements in the row.
- So, understanding basics of dictionary is helpful when you create a dataframe. 
- Dictionary object supports few methods like items(), keys(), values() etc to extract the elements within the dictionary. We will use them as and when required.

###### Series is a data structure in Pandas which is a single column vector containing elements

In [11]:
import pandas as pd

In [12]:
# create a series of elements
s = pd.Series([10,20,30,40])
print(s)

0    10
1    20
2    30
3    40
dtype: int64


In [13]:
# access the elements of a series
print(s[2])

30


In [14]:
# give a user-defined index to series elements
s1 = pd.Series([10,20,30,40],index=['first', 'second','third','fourth'])
print(s1)

first     10
second    20
third     30
fourth    40
dtype: int64


In [15]:
# access using either default index or user-defined index
print(s1['second'])
print(s1[1])

20
20


**NOTE:**
- Each column in a dataframe is treated as a series.
- So, whenever you would like to operate on only one column of the dataframe, we must use built-in operations meant for series.

#### Creating DataFrame using Lists, Dictionaries and Series

In [16]:
# create a list of indices
index_list=['Company A','Company B','Company C','Company D','Company E','Company F']

In [17]:
# create a data frame, where key is the company name and values are the series 
#map the values with the index_list while creating the series

company_dir = {'Closing price': pd.Series([346.15,0.59,459,0.52,589.8,158.88], 
                                        index=index_list),
             
                'EPS': pd.Series([1133.43,36.05,145.02, 4.5, 31.44,380.64],
                                 index=index_list),
                
                'Beta': pd.Series([1,2,3,4,5,6],index=index_list),
                'P/E': pd.Series([10,20,30,40,60,50], index=index_list),
                'Market Cap(B)': pd.Series([1254.05, 43.2, 2300, 5.6, 773.8, 521.56], index=index_list)
                
               }

In [18]:
company_dir

{'Closing price': Company A    346.15
 Company B      0.59
 Company C    459.00
 Company D      0.52
 Company E    589.80
 Company F    158.88
 dtype: float64,
 'EPS': Company A    1133.43
 Company B      36.05
 Company C     145.02
 Company D       4.50
 Company E      31.44
 Company F     380.64
 dtype: float64,
 'Beta': Company A    1
 Company B    2
 Company C    3
 Company D    4
 Company E    5
 Company F    6
 dtype: int64,
 'P/E': Company A    10
 Company B    20
 Company C    30
 Company D    40
 Company E    60
 Company F    50
 dtype: int64,
 'Market Cap(B)': Company A    1254.05
 Company B      43.20
 Company C    2300.00
 Company D       5.60
 Company E     773.80
 Company F     521.56
 dtype: float64}

In [19]:
companydf = pd.DataFrame(company_dir)
companydf

Unnamed: 0,Closing price,EPS,Beta,P/E,Market Cap(B)
Company A,346.15,1133.43,1,10,1254.05
Company B,0.59,36.05,2,20,43.2
Company C,459.0,145.02,3,30,2300.0
Company D,0.52,4.5,4,40,5.6
Company E,589.8,31.44,5,60,773.8
Company F,158.88,380.64,6,50,521.56


**NOTE:**
- We created a new dataframe using various data structures here. 
- There are multiple other ways to do so, but we will not explore those, because most of time we will be using business data which will be already saved in Excel or CSV formats

### Part 2.2.2  : Slicing with precision
- Slicing is a way of extracting the subset of the dataframe.
- We can extract the required rows and columns by using appropriate indices for slicing.
- There are two methods ***loc()*** and ***iloc()*** that can be used to create subsets of the data
- We will use the ***companydf*** dataframe created in the previous video to demonstrate these concepts

In [20]:
#let us view the data frame
companydf

Unnamed: 0,Closing price,EPS,Beta,P/E,Market Cap(B)
Company A,346.15,1133.43,1,10,1254.05
Company B,0.59,36.05,2,20,43.2
Company C,459.0,145.02,3,30,2300.0
Company D,0.52,4.5,4,40,5.6
Company E,589.8,31.44,5,60,773.8
Company F,158.88,380.64,6,50,521.56


In [21]:
#extract only few rows by specifying starting and ending row indices
#the first index in the slicing is included and the last index is not

companydf[1:4]    

Unnamed: 0,Closing price,EPS,Beta,P/E,Market Cap(B)
Company B,0.59,36.05,2,20,43.2
Company C,459.0,145.02,3,30,2300.0
Company D,0.52,4.5,4,40,5.6


In [22]:
#The dataframe can understand the negative indices, where -1, -2 etc. indicates the rows from the last 

companydf[-3:]    #extract rows from the index -3 till the end of the dataframe

Unnamed: 0,Closing price,EPS,Beta,P/E,Market Cap(B)
Company D,0.52,4.5,4,40,5.6
Company E,589.8,31.44,5,60,773.8
Company F,158.88,380.64,6,50,521.56


In [23]:
#extract only a particular column 
companydf['EPS']    #observe that the result is a series 

Company A    1133.43
Company B      36.05
Company C     145.02
Company D       4.50
Company E      31.44
Company F     380.64
Name: EPS, dtype: float64

In [24]:
#extract one column with required rows
companydf['Beta'][1:4]     #column is extracted first, and from that series, required rows are sliced

Company B    2
Company C    3
Company D    4
Name: Beta, dtype: int64

#### Using functions `loc()` and `iloc()`
- `loc` gets rows (or columns) with particular labels/header name for both rows and columns
- `iloc` gets rows (or columns) at particular positions (so it only takes integers).

In [25]:
# We must give the actual indices which are in string format
companydf.loc['Company B': 'Company D']       #note that both starting and ending indices are inclusive

Unnamed: 0,Closing price,EPS,Beta,P/E,Market Cap(B)
Company B,0.59,36.05,2,20,43.2
Company C,459.0,145.02,3,30,2300.0
Company D,0.52,4.5,4,40,5.6


In [26]:
# observe the list of column names used to extract only the required columns
companydf.loc['Company B': 'Company D'][['EPS', 'Beta']] 

Unnamed: 0,EPS,Beta
Company B,36.05,2
Company C,145.02,3
Company D,4.5,4


In [27]:
#following code generates error, because the row indices of the data frame are not numbers
companydf.loc[:3]    

TypeError: cannot do slice indexing on Index with these indexers [3] of type int

In [28]:
# when you want to extract the rows and columns using the index positions, use iloc()
# Extract using column number
companydf.iloc[1:3,0:2]   #row numbers 1, 2 and column numbers 0 and 1 are displayed


Unnamed: 0,Closing price,EPS
Company B,0.59,36.05
Company C,459.0,145.02


In [29]:
#we can use lists to select any row/column, not necessarily in a sequence
companydf.iloc[:,[1,4,3]]  #all rows of columns 1, 3 and 4 are displayed

Unnamed: 0,EPS,Market Cap(B),P/E
Company A,1133.43,1254.05,10
Company B,36.05,43.2,20
Company C,145.02,2300.0,30
Company D,4.5,5.6,40
Company E,31.44,773.8,60
Company F,380.64,521.56,50


In [30]:
#row 0 and 2 of column 1 and 3  are displayed
companydf.iloc[[0,2],[1,3]]   

Unnamed: 0,EPS,P/E
Company A,1133.43,10
Company C,145.02,30


### Part 2.2.3  : Changing the indices and saving the new dataframe

- We will look at how to turn an index column into one of the attributes in the dataframe, and vice-versa
- Also, we will see how to change the name of a column
- For demonstrating these, we will use **companydf** dataframe created earlier.

In [31]:
# view the dataframe
companydf

Unnamed: 0,Closing price,EPS,Beta,P/E,Market Cap(B)
Company A,346.15,1133.43,1,10,1254.05
Company B,0.59,36.05,2,20,43.2
Company C,459.0,145.02,3,30,2300.0
Company D,0.52,4.5,4,40,5.6
Company E,589.8,31.44,5,60,773.8
Company F,158.88,380.64,6,50,521.56


In [32]:
# use the reset_index() method to make the existing indices a column of the data frame
companydf.reset_index(inplace = True)
companydf

Unnamed: 0,index,Closing price,EPS,Beta,P/E,Market Cap(B)
0,Company A,346.15,1133.43,1,10,1254.05
1,Company B,0.59,36.05,2,20,43.2
2,Company C,459.0,145.02,3,30,2300.0
3,Company D,0.52,4.5,4,40,5.6
4,Company E,589.8,31.44,5,60,773.8
5,Company F,158.88,380.64,6,50,521.56


In [33]:
# give a new name to this column
companydf.rename(columns={'index':'Company names'},inplace=True)
companydf

Unnamed: 0,Company names,Closing price,EPS,Beta,P/E,Market Cap(B)
0,Company A,346.15,1133.43,1,10,1254.05
1,Company B,0.59,36.05,2,20,43.2
2,Company C,459.0,145.02,3,30,2300.0
3,Company D,0.52,4.5,4,40,5.6
4,Company E,589.8,31.44,5,60,773.8
5,Company F,158.88,380.64,6,50,521.56


In [34]:
# choose a column and make it the index of the dataframe
companydf=companydf.set_index('Company names')
companydf

Unnamed: 0_level_0,Closing price,EPS,Beta,P/E,Market Cap(B)
Company names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Company A,346.15,1133.43,1,10,1254.05
Company B,0.59,36.05,2,20,43.2
Company C,459.0,145.02,3,30,2300.0
Company D,0.52,4.5,4,40,5.6
Company E,589.8,31.44,5,60,773.8
Company F,158.88,380.64,6,50,521.56


##### Save the dataframe into a file

In [35]:
# save the data to a CSV
companydf.to_csv('CompanyData.csv')