___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" 
alt="CLRSWY"></p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:100%; text-align:center; border-radius:10px 10px;">WAY TO REINVENT YOURSELF</p>

<img src=https://i.ibb.co/6gCsHd6/1200px-Pandas-logo-svg.png width="700" height="200">

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:200%; text-align:center; border-radius:10px 10px;">Data Analysis with Python</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:150%; text-align:center; border-radius:10px 10px;">Lab-01 Session</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#4d77cf; font-size:200%; text-align:center; border-radius:10px 10px;">Playing with Pandas Series & DataFrames</p>

<a id="toc"></a>

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Content</p>

* [OVERVIEW](#0)
* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#1)
* [CREATING A PANDAS SERIES](#2)
* [WORKING WITH SERIES DATA STRUCTURE](#3)
* [CREATING A PANDAS DATAFRAMES](#4)
* [WORKING WITH DATAFRAMES](#5)
* [INDEXING, SLICING & SELECTION](#6)
* [THE END OF THE LAB-01 SESSION](#7)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Overview</p>

<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

## What is Pandas in Python?

[**Pandas**](http://pandas.pydata.org/) is the most famous python library providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way towards this goal.

In Pandas, the data is usually utilized to support the statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.

Its popularity has surged in recent years, coincident with the rise of fields such as data science and machine learning. Here’s a popularity comparison over time against STATA, SAS, and [dplyr](https://dplyr.tidyverse.org/) courtesy of Stack Overflow Trends

<img src="https://i.ibb.co/crf3ksp/pandas-vs-rest.png" style="">

## Core Components of Pandas Data Structure

Organizing any data in a particular way is known as a data structure. **``Pandas``** have **two core data structure** components, and all operations are based on those two objects. Here are the two pandas data structures:

  - [**Series :**](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) A kind of one-dimensional array of any data type that we specified in the pandas module.
  - [**DataFrame :**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns of potentially different types.

## Main Features

Just as [**NumPy**](http://www.numpy.org/) provides the basic array data type plus core array operations, **``Pandas``**;

1. defines fundamental structures for working with data and  
1. endows them with methods that facilitate operations such as  
  
  - reading in data  
  - adjusting indices  
  - working with dates and time series  
  - sorting, grouping, re-ordering and general data munging <sup><a href=#mung id=mung-link>[1]</a></sup>  
  - dealing with missing values, etc., etc.  
  
Here are just a few of the things that pandas does well:

  - Easy handling of [missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) (represented as **``NaN``**) in floating point as well as non-floating point data
  - Size mutability: columns can be [inserted and deleted](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html) from DataFrame and higher dimensional objects
  - Automatic and explicit [data alignment](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html): objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let **``Series``**, **``DataFrame``**, etc. automatically align the data for you in computations
  - Powerful, flexible [group by](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  - Make it [easy to convert](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html) ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  - Intelligent label-based [slicing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html), [fancy indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html), and [subsetting](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) of large data sets
  - Intuitive [merging](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) and [joining](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) datasets
  - Flexible [reshaping](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html) and [pivoting](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html) of datasets
  - [Hierarchical labeling](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) of axes (possible to have multiple labels per tick)
  - Robust IO tools for loading data from [flat files](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) (CSV and delimited), [Excel files](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html), [databases](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html), and saving/loading data from the ultrafast [HDF5 format](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)
  - [Time series](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

More sophisticated statistical functionality is left to other packages, such as [statsmodels](http://www.statsmodels.org/) and [scikit-learn](http://scikit-learn.org/), which are built on top of pandas.

This session will provide a basic introduction to Pandas. Throughout the session, we will assume that the following imports have taken place.

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Importing Libraries Needed in This Notebook</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Once you've installed NumPy & Pandas you can import them as a library:

In [79]:
import numpy as np
import pandas as pd

pd.options.display.float_format = '{:20,.2f}'.format  # Suppressing scientific notation in pandas

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Creating a Pandas Series</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

**``Pandas Series``** is a **one-dimensional** data structure. It can hold data of many types including **``objects``**, **``floats``**, **``strings``** and **``integers``**. You can create a Series by calling **``pandas.Series()``**. A **``list``**, **``numpy array``**, **``dict``** can be turned into a **``Pandas Series``**. You should use the simplest data structure that meets your needs [Source](https://pythonbasics.org/pandas-series/). The **``axis labels``** are collectively called **``index``**. **``Labels``** need not to be unique but must be a [**hashable type**](https://stackoverflow.com/questions/14535730/what-does-hashable-mean-in-python#:~:text=In%20Python%2C%20any%20immutable%20object,sets%20to%20track%20unique%20values.). The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index [Source](https://www.geeksforgeeks.org/creating-a-pandas-series/).

**``Series``** and **``DataFrame``** are two important data types defined by Pandas.

You can think of a Series as a “column” of data, such as a collection of observations on a single variable.

A DataFrame is an object for storing related columns of data or combination of Series.

**``A Series``** holding a variety of object types is a **one-dimensional data structure** and **homogeneous**; that is, all data are of the same type and are implicitly labelled with an index. For example, we can have a Series of integers, real numbers, characters, strings, dictionaries, etc. We can conveniently manipulate these series performing operations like adding, deleting, ordering, joining, filtering, vectorized operations, statistical analysis, plotting, etc. 

**``A Series``** is very **similar to a NumPy array** (in fact it is built on top of the NumPy array object). **What differentiates** the NumPy array from a Series, is that a Series can **have axis labels**, meaning it can be indexed by a label, instead of just a number location. It also doesn’t need to hold numeric data, it can hold any arbitrary Python Object [Source](http://www.datasciencelovers.com/python-for-data-science/pandas-series/).

So important point to remember for Pandas series is:

- Homogeneous data
- Size Immutable
- Values of Data Mutable

**You can create a Series by calling** **``pandas.Series()``** . A **``list``**, **``numpy array``**, **``dict``** can be turned into a Pandas Series.

**``pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)``**

In [80]:
# First, let's create a Python Pandas Series of characters:

s = pd.Series(['a', 'b', 'c', 'd', 'e'])
s

0    a
1    b
2    c
3    d
4    e
dtype: object

In [135]:
pd.Series(('a', 'b', 'c', 'd', 'e'))

0    a
1    b
2    c
3    d
4    e
dtype: object

In [81]:
# Let's create a random Python Pandas Series of float numbers:

s = pd.Series(np.random.randn(4), name='Daily Returns')
s

0                  -0.46
1                  -0.94
2                   1.05
3                   1.26
Name: Daily Returns, dtype: float64

In [82]:
s.name

'Daily Returns'

We have allowed the index label to appear by default. It starts at 0, and we can check the index as:

In [83]:
s.index

RangeIndex(start=0, stop=4, step=1)

Here you can imagine the indices of `0, 1, 2, 3` as if four listed companies, and the values being daily returns on their shares.

One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.). 

**``Pandas Series``** **are built on top of** **``NumPy arrays``** **and support many** **``similar operations:``**

In [84]:
[1, 2, 3, 4] * 3

[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]

In mathematics, element-wise operations refer to operations on individual elements of a matrix. Any arithmetic operations in arrays applies the operation elementwise. **[NumPy Basics: Arrays and Vectorized Computation](https://www.oreilly.com/library/view/python-for-data/9781449323592/ch04.html)** & **[Numerical Operations on Arrays](https://scipy-lectures.org/intro/numpy/operations.html)**

In [85]:
s * 100

0                 -46.07
1                 -93.86
2                 105.24
3                 126.50
Name: Daily Returns, dtype: float64

In [86]:
np.abs(s)

0                   0.46
1                   0.94
2                   1.05
3                   1.26
Name: Daily Returns, dtype: float64

But `Series` provide more than NumPy arrays.

Not only do they have some additional (statistically oriented) methods

In [87]:
s.describe()

count                   4.00
mean                    0.23
std                     1.09
min                    -0.94
25%                    -0.58
50%                     0.30
75%                     1.11
max                     1.26
Name: Daily Returns, dtype: float64

In [88]:
s.std()

1.0939616138788835

But their indices are more flexible and we can specify the index we need:

In [89]:
s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG']
s

AMZN                  -0.46
AAPL                  -0.94
MSFT                   1.05
GOOG                   1.26
Name: Daily Returns, dtype: float64

Viewed in this way, `Series` are like fast, efficient Python dictionaries (with the restriction that the items in the dictionary all have the same type—in this case, floats).

In fact, you can use much of the same syntax as Python dictionaries

In [90]:
s['AMZN']

-0.46067834411257047

It is also possible to assign a new value (Remember and compare it with **[Broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html#general-broadcasting-rules)**)

In [91]:
s['AMZN'] = 999
s

AMZN                 999.00
AAPL                  -0.94
MSFT                   1.05
GOOG                   1.26
Name: Daily Returns, dtype: float64

**[Vectorization in Python](https://www.askpython.com/python-modules/numpy/vectorization-numpy):** Vectorization is a technique of implementing array operations without using for loops.<br>
**[A Gentle Introduction to Broadcasting with NumPy Arrays](https://machinelearningmastery.com/broadcasting-with-numpy-arrays/):** Broadcasting is the name given to the method that NumPy uses to allow array arithmetic between arrays with a different shape or size.<br>
Broadcasting is an operation of matching the dimensions of differently shaped arrays in order to be able to perform further operations on those arrays (eg per-element arithmetic).<br>
**[Broadcasting in NumPy](https://towardsdatascience.com/broadcasting-in-numpy-58856f926d73):** Broadcasting is an operation of matching the dimensions of differently shaped arrays in order to be able to perform further operations on those arrays (eg per-element arithmetic).

In [92]:
'AAPL' in s

True

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Working with Series Data Structure</p>

<a id="3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

**SOME COMMON ATTRIBUTES** [Official Pandas API Document](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)<br>

**Series.index**	Defines the index of the Series.<br>
**Series.values**   Returns Series as ndarray or ndarray-like depending on the dtype.<br>
**Series.shape**	It returns a tuple of shape of the data.<br>
**Series.dtype**	It returns the data type of the data.<br>
**Series.size**	It returns the size of the data.<br>
**Series.empty**	It returns True if Series object is empty, otherwise returns false.<br>
**Series.hasnans**	It returns True if there are any NaN values, otherwise returns false.<br>
**Series.nbytes**	It returns the number of bytes in the data.<br>
**Series.ndim**	It returns the number of dimensions in the data.<br>

In [93]:
games = pd.read_csv("vgsalesGlobale.csv")
games

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.00,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.00,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.00,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.00,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
4,5,Pokemon Red/Pokemon Blue,GB,1996.00,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
...,...,...,...,...,...,...,...,...,...,...,...
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.00,Platform,Kemco,0.01,0.00,0.00,0.00,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.00,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.00,Racing,Activision,0.00,0.00,0.00,0.00,0.01
16596,16599,Know How 2,DS,2010.00,Puzzle,7G//AMES,0.00,0.01,0.00,0.00,0.01


[**head(n=5)**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].

In [94]:
games.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


**[tail()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html)** function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.

In [95]:
games.tail()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,Platform,Kemco,0.01,0.0,0.0,0.0,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.0,0.0,0.0,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.0,0.0,0.0,0.0,0.01
16596,16599,Know How 2,DS,2010.0,Puzzle,7G//AMES,0.0,0.01,0.0,0.0,0.01
16597,16600,Spirits & Spells,GBA,2003.0,Platform,Wanadoo,0.01,0.0,0.0,0.0,0.01


[**dtypes**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html) attribute returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. Columns with **mixed data types** are stored with the **``object dtype``**. See the [User Guide](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes) for more.

![image.png](attachment:image.png)

In [96]:
df = pd.DataFrame({'float': [1.0, 2.2],
                   'int': [1, 2],
                   'datetime': pd.date_range('12/1/2018', periods=2, freq='D'),
                   'string': ['foo', 2]})

print(df)
print("*"*50)
print(df.dtypes)

                 float  int   datetime string
0                 1.00    1 2018-12-01    foo
1                 2.20    2 2018-12-02      2
**************************************************
float              float64
int                  int64
datetime    datetime64[ns]
string              object
dtype: object


In [97]:
games.dtypes

Rank              int64
Name             object
Platform         object
Year            float64
Genre            object
Publisher        object
NA_Sales        float64
EU_Sales        float64
JP_Sales        float64
Other_Sales     float64
Global_Sales    float64
dtype: object

In [98]:
games.Genre.dtypes

dtype('O')

In [99]:
games.Genre.describe()

count      16598
unique        12
top       Action
freq        3316
Name: Genre, dtype: object

[**value_counts()**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html) returns a Series containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

In [100]:
games.Genre.value_counts() 

Action          3316
Sports          2346
Misc            1739
Role-Playing    1488
Shooter         1310
Adventure       1286
Racing          1249
Platform         886
Simulation       867
Fighting         848
Strategy         681
Puzzle           582
Name: Genre, dtype: int64

With normalize set to True, returns the relative frequency by dividing all values by the sum of values.

In [101]:
games.Genre.value_counts(normalize=True) 

Action                         0.20
Sports                         0.14
Misc                           0.10
Role-Playing                   0.09
Shooter                        0.08
Adventure                      0.08
Racing                         0.08
Platform                       0.05
Simulation                     0.05
Fighting                       0.05
Strategy                       0.04
Puzzle                         0.04
Name: Genre, dtype: float64

20% of the computer games sold are action games

**``bin``** parameter inside value_counts() is probably the most underutilized one. value_counts() can be used to bin continuous data into discrete intervals with the help of the bin parameter. This option works only with numerical data. It is similar to the **``pd.cut()``** function. 

In [102]:
games.EU_Sales.value_counts() 

 0.00    5730
 0.01    1496
 0.02    1269
 0.03     934
 0.04     748
         ... 
 3.42       1
 2.38       1
 1.99       1
 2.10       1
29.02       1
Name: EU_Sales, Length: 305, dtype: int64

Using value_counts() in a plain way sometimes doesn’t convey much information as the output contains a lot of categories for every value of related feature. Instead, let’s group them into 4 bins.

In [103]:
games.EU_Sales.value_counts(bins=4) 

(-0.030000000000000002, 7.255]    16586
(7.255, 14.51]                       11
(21.765, 29.02]                       1
(14.51, 21.765]                       0
Name: EU_Sales, dtype: int64

Binning makes it easy to understand the idea being conveyed. We can easily see that most of the gamers in EU paid less than 7.255 for their games. Also, we can see that having four bins serves our purpose since no sales falls into the last bin.

In [104]:
type(games.Genre.value_counts())

pandas.core.series.Series

In [105]:
games.Genre.unique()

array(['Sports', 'Platform', 'Racing', 'Role-Playing', 'Puzzle', 'Misc',
       'Shooter', 'Simulation', 'Action', 'Fighting', 'Adventure',
       'Strategy'], dtype=object)

In [106]:
games.Genre.nunique()

12

In [107]:
len(games.Genre.value_counts()) 

12

In [108]:
games.Genre.ndim

1

In [109]:
games.Genre.size

16598

We can also obtain its dimension and size by just another attribute, **``shape``**, returns a tuple of the shape of the underlying data.

In [110]:
games.Genre.shape

(16598,)

In [111]:
games.Genre.index

RangeIndex(start=0, stop=16598, step=1)

In [112]:
games.Global_Sales.describe()

count              16,598.00
mean                    0.54
std                     1.56
min                     0.01
25%                     0.06
50%                     0.17
75%                     0.47
max                    82.74
Name: Global_Sales, dtype: float64

In [113]:
games.Global_Sales.mean()

0.53744065550074

In [114]:
games.Global_Sales.median()

0.17

In [115]:
games.Global_Sales.quantile(q=[0.25, 0.5, 0.75], interpolation='linear')

0.25                   0.06
0.50                   0.17
0.75                   0.47
Name: Global_Sales, dtype: float64

In [116]:
games.Global_Sales.index

RangeIndex(start=0, stop=16598, step=1)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Creating a Pandas DataFrames</p>

<a id="4"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object. Pandas DataFrame can be created in multiple ways:

  - Creating Pandas DataFrame from lists of lists.
  - Creating DataFrame from dict of narray/lists.
  - Creating Dataframe from list of dicts.
  - Creating DataFrame using zip() function.
  - Creating DataFrame from Dicts of series.
  
[Source](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), [Source](https://www.javatpoint.com/how-to-create-a-dataframes-in-python), [Source](https://towardsdatascience.com/15-ways-to-create-a-pandas-dataframe-754ecc082c17)

**Let's remember how to create a DataFrame in Pandas:**

In [117]:
data = {"name":["Bill", "Tom", "Tim", "John", "Alex", "Vanessa", "Kate"],
        "score":[90, 80, 85, 75, 95, 60, 65],
        "sport":["Wrestling", "Football", "Skiing", "Swimming", "Tennis", "Karete", "Surfing"],
        "sex":["M", "M", "M", "M", "F", "F", "F"]}
data

{'name': ['Bill', 'Tom', 'Tim', 'John', 'Alex', 'Vanessa', 'Kate'],
 'score': [90, 80, 85, 75, 95, 60, 65],
 'sport': ['Wrestling',
  'Football',
  'Skiing',
  'Swimming',
  'Tennis',
  'Karete',
  'Surfing'],
 'sex': ['M', 'M', 'M', 'M', 'F', 'F', 'F']}

As seen, we have created a Dictionary and assigned it to an object named "data".

**pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)** [Official Pandas API](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

In [118]:
df = pd.DataFrame(data)
df

Unnamed: 0,name,score,sport,sex
0,Bill,90,Wrestling,M
1,Tom,80,Football,M
2,Tim,85,Skiing,M
3,John,75,Swimming,M
4,Alex,95,Tennis,F
5,Vanessa,60,Karete,F
6,Kate,65,Surfing,F


## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Working with DataFrames</p>

<a id="5"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

While a `Series` is a single column of data, a `DataFrame` is several columns, one for each variable.

In essence, a `DataFrame` in pandas is analogous to a (highly optimized) Excel spreadsheet.

The two main data structures in pandas both have at least one axis. A **Series** has **one axis**, the index. A **DataFrame** has **two axes**, the index and the columns. It’s useful to note here that in all the DataFrame functions that can be applied to either rows or columns, an axis of 0 refers to the index, an axis of 1 refers to the columns.

Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns, often with  descriptive indexes for individual rows and individual columns.

Let’s look at an example that reads data from the CSV file named `test_lab.csv`. 

In [119]:
df = pd.read_csv('test_lab.csv')
print(f"\033[1mThe type of test_lab.csv is\033[0m {type(df)}")
df

[1mThe type of test_lab.csv is[0m <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,Argentina,ARG,2000,37335.65,1.0,295072.22,75.72,5.58
1,Australia,AUS,2000,19053.19,1.72,541804.65,67.76,6.72
2,India,IND,2000,1006300.3,44.94,1728144.37,64.58,14.07
3,Israel,ISR,2000,6114.57,4.08,129253.89,64.44,10.27
4,Malawi,MWI,2000,11801.5,59.54,5026.22,74.71,11.66
5,South Africa,ZAF,2000,45064.1,6.94,227242.37,72.72,5.73
6,United States,USA,2000,282171.96,1.0,9898700.0,72.35,6.03
7,Uruguay,URY,2000,3219.79,12.1,25255.96,78.98,5.11


[**info()**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) prints a concise summary of a DataFrame. This method prints information about a DataFrame including the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

In [120]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country          8 non-null      object 
 1   country isocode  8 non-null      object 
 2   year             8 non-null      int64  
 3   POP              8 non-null      float64
 4   XRAT             8 non-null      float64
 5   tcgdp            8 non-null      float64
 6   cc               8 non-null      float64
 7   cg               8 non-null      float64
dtypes: float64(5), int64(1), object(2)
memory usage: 640.0+ bytes


In [121]:
df.shape

(8, 8)

In [122]:
df.size

64

In [123]:
df.ndim

2

In [124]:
df.columns

Index(['country', 'country isocode', 'year', 'POP', 'XRAT', 'tcgdp', 'cc',
       'cg'],
      dtype='object')

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Indexing, Slicing & Selection</p>

<a id="6"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

As stated and implemented by examples above, a Series is very similar to a NumPy array. [**What differentiates the NumPy array from a Series**](https://www.educba.com/pandas-vs-numpy/) is that **a Series can have axis labels**, meaning it can be indexed by a label, instead of just a number location. In otherwords, the essential difference is **the presence of the index**: while the **``Numpy Array``** has an implicitly defined integer index used to access the values, the **``Pandas Series``** has an explicitly defined index associated with the values (labels). Moreover, it doesn NOT need to hold numeric data, it can hold any arbitrary Python Object [Source](https://rpubs.com/pjozefek/659184).

So, the key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look up of information.

The axis labeling information in pandas objects serves many purposes:

- Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
- Enables automatic and explicit data alignment.
- Allows intuitive getting and setting of subsets of the data set.

In this part of our session, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas objects. The primary focus will be on Series and DataFrame as they have received more development attention in this area. For more information, please visit [pandas-docs.github.io](https://pandas-docs.github.io/pandas-docs-travis/user_guide/indexing.html)

The most robust and consistent way of slicing ranges along arbitrary axes is described in the [Selection by Position](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-integer) section detailing the .iloc method. First, let us look at the semantics of slicing using the **[ ]** operator.

[Some More Examples](https://sparkbyexamples.com/pandas/how-to-slice-columns-in-pandas-dataframe/#:~:text=By%20using%20pandas.,columns%2C%20the%20syntax%20is%20df.)

In [125]:
df["country"]

0        Argentina
1        Australia
2            India
3           Israel
4           Malawi
5     South Africa
6    United States
7          Uruguay
Name: country, dtype: object

In [126]:
df[['country', 'POP']]

Unnamed: 0,country,POP
0,Argentina,37335.65
1,Australia,19053.19
2,India,1006300.3
3,Israel,6114.57
4,Malawi,11801.5
5,South Africa,45064.1
6,United States,282171.96
7,Uruguay,3219.79


In [127]:
df[:][['country', 'POP']]

Unnamed: 0,country,POP
0,Argentina,37335.65
1,Australia,19053.19
2,India,1006300.3
3,Israel,6114.57
4,Malawi,11801.5
5,South Africa,45064.1
6,United States,282171.96
7,Uruguay,3219.79


In [128]:
df[2:4][['country', 'POP']]

Unnamed: 0,country,POP
2,India,1006300.3
3,Israel,6114.57


### [.loc[ ]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) → allows us to select data using **labels** (names) of rows (index) & columns

### [.iloc[ ]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) → allows us to select data using **index numbers** of rows (index) & columns. it's like classical indexing logic

Let us first remember our df: 

In [129]:
df

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,Argentina,ARG,2000,37335.65,1.0,295072.22,75.72,5.58
1,Australia,AUS,2000,19053.19,1.72,541804.65,67.76,6.72
2,India,IND,2000,1006300.3,44.94,1728144.37,64.58,14.07
3,Israel,ISR,2000,6114.57,4.08,129253.89,64.44,10.27
4,Malawi,MWI,2000,11801.5,59.54,5026.22,74.71,11.66
5,South Africa,ZAF,2000,45064.1,6.94,227242.37,72.72,5.73
6,United States,USA,2000,282171.96,1.0,9898700.0,72.35,6.03
7,Uruguay,URY,2000,3219.79,12.1,25255.96,78.98,5.11


In [130]:
df.iloc[2:5]

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
2,India,IND,2000,1006300.3,44.94,1728144.37,64.58,14.07
3,Israel,ISR,2000,6114.57,4.08,129253.89,64.44,10.27
4,Malawi,MWI,2000,11801.5,59.54,5026.22,74.71,11.66


In [131]:
df.loc[2:5]

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
2,India,IND,2000,1006300.3,44.94,1728144.37,64.58,14.07
3,Israel,ISR,2000,6114.57,4.08,129253.89,64.44,10.27
4,Malawi,MWI,2000,11801.5,59.54,5026.22,74.71,11.66
5,South Africa,ZAF,2000,45064.1,6.94,227242.37,72.72,5.73


### **``QUESTION:``** **What happened? Why was South Africa included when** **``loc``** **used?**

The first thing remembered is that **``.iloc[ ]``** is **exclusive** while **``.loc[ ]``** is **inclusive.**

## 1) **``Using Pandas.DataFrame.loc[]``** (By label)


**1.1 – Slicing Columns by Names or Labels**

By using **``pandas.DataFrame.loc[ ]``** you can slice columns by names or labels. To slice the columns, the syntax is **``df.loc[:, start:stop:step]``**; where start is the name of the first column to take, stop is the name of the last column to take, and step as the number of indices to advance after each extraction

In [132]:
df.loc[:, "country": "POP"]

Unnamed: 0,country,country isocode,year,POP
0,Argentina,ARG,2000,37335.65
1,Australia,AUS,2000,19053.19
2,India,IND,2000,1006300.3
3,Israel,ISR,2000,6114.57
4,Malawi,MWI,2000,11801.5
5,South Africa,ZAF,2000,45064.1
6,United States,USA,2000,282171.96
7,Uruguay,URY,2000,3219.79


In [133]:
df.loc[2:6, "country": "POP"]

Unnamed: 0,country,country isocode,year,POP
2,India,IND,2000,1006300.3
3,Israel,ISR,2000,6114.57
4,Malawi,MWI,2000,11801.5
5,South Africa,ZAF,2000,45064.1
6,United States,USA,2000,282171.96


**1.2 – Slicing DataFrame Columns by Labels**

To slice DataFrame columns by labels or names, all you need is to provide the multiple labels you wanted to slice as a list. Here we use the list of labels instead of the start:stop:step approach.

In [61]:
df.loc[:, ['country', 'country isocode', 'year', 'POP']]

Unnamed: 0,country,country isocode,year,POP
0,Argentina,ARG,2000,37335.65
1,Australia,AUS,2000,19053.19
2,India,IND,2000,1006300.3
3,Israel,ISR,2000,6114.57
4,Malawi,MWI,2000,11801.5
5,South Africa,ZAF,2000,45064.1
6,United States,USA,2000,282171.96
7,Uruguay,URY,2000,3219.79


In [134]:
df.loc[:, ('country', 'country isocode', 'year', 'POP')]

Unnamed: 0,country,country isocode,year,POP
0,Argentina,ARG,2000,37335.65
1,Australia,AUS,2000,19053.19
2,India,IND,2000,1006300.3
3,Israel,ISR,2000,6114.57
4,Malawi,MWI,2000,11801.5
5,South Africa,ZAF,2000,45064.1
6,United States,USA,2000,282171.96
7,Uruguay,URY,2000,3219.79


**1.3 – Slicing DataFrame Columns by Range**

When you wanted to slice a DataFrame by the range of columns, provide start and stop column names.

  - By not providing a start column, loc[] selects from the beginning.
  - By not providing stop, loc[] selects all columns from the start label.
  - Providing both start and stop, selects all columns in between.

In [62]:
# Slicing all columns between "country" an 'POP' columns

df.loc[:, 'country':'POP']

Unnamed: 0,country,country isocode,year,POP
0,Argentina,ARG,2000,37335.65
1,Australia,AUS,2000,19053.19
2,India,IND,2000,1006300.3
3,Israel,ISR,2000,6114.57
4,Malawi,MWI,2000,11801.5
5,South Africa,ZAF,2000,45064.1
6,United States,USA,2000,282171.96
7,Uruguay,URY,2000,3219.79


In [63]:
# Slicing by start from 'country isocode' column

df.loc[:, 'country isocode':]

Unnamed: 0,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,ARG,2000,37335.65,1.0,295072.22,75.72,5.58
1,AUS,2000,19053.19,1.72,541804.65,67.76,6.72
2,IND,2000,1006300.3,44.94,1728144.37,64.58,14.07
3,ISR,2000,6114.57,4.08,129253.89,64.44,10.27
4,MWI,2000,11801.5,59.54,5026.22,74.71,11.66
5,ZAF,2000,45064.1,6.94,227242.37,72.72,5.73
6,USA,2000,282171.96,1.0,9898700.0,72.35,6.03
7,URY,2000,3219.79,12.1,25255.96,78.98,5.11


In [64]:
# Slicing by start from beginning and end at 'Duration' column

df.loc[:, :'XRAT']

Unnamed: 0,country,country isocode,year,POP,XRAT
0,Argentina,ARG,2000,37335.65,1.0
1,Australia,AUS,2000,19053.19,1.72
2,India,IND,2000,1006300.3,44.94
3,Israel,ISR,2000,6114.57,4.08
4,Malawi,MWI,2000,11801.5,59.54
5,South Africa,ZAF,2000,45064.1,6.94
6,United States,USA,2000,282171.96,1.0
7,Uruguay,URY,2000,3219.79,12.1


**1.4 – Slicing Certain Selective Columns in pandas**

Sometimes you may want to select random certain columns from pandas DataFrame, you can do this by passing selected column names/labels as a list.

In [65]:
df.loc[:, ['country', 'year', 'POP']]

Unnamed: 0,country,year,POP
0,Argentina,2000,37335.65
1,Australia,2000,19053.19
2,India,2000,1006300.3
3,Israel,2000,6114.57
4,Malawi,2000,11801.5
5,South Africa,2000,45064.1
6,United States,2000,282171.96
7,Uruguay,2000,3219.79


**1.5 – Selecting Every Alternate Column**

Using **``loc[ ]``**, you can also slice columns by selecting every other column from pandas DataFrame.

In [66]:
df.loc[:, ::2]

Unnamed: 0,country,year,XRAT,cc
0,Argentina,2000,1.0,75.72
1,Australia,2000,1.72,67.76
2,India,2000,44.94,64.58
3,Israel,2000,4.08,64.44
4,Malawi,2000,59.54,74.71
5,South Africa,2000,6.94,72.72
6,United States,2000,1.0,72.35
7,Uruguay,2000,12.1,78.98


## 2) **``Using Pandas.DataFrame.iloc[]``** (By position)

By using **``pandas.DataFrame.iloc[ ]``** you can slice DataFrame by column **position/index**. Always remember that index starts from 0. You can use **``pandas.DataFrame.iloc[ ]``** with the syntax **``[:, start:stop:step]``**; where **start** indicates the index of the first column to take, **stop** indicates the index of the last column to take, and **step** indicates the number of indices to advance after each extraction. Or, use the syntax: **``[:, [indices]]``** with indices as a list of column indices to take.

**2.1 – Slicing Columns by Index Position**

In [67]:
# Let us first remember our df

df

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,Argentina,ARG,2000,37335.65,1.0,295072.22,75.72,5.58
1,Australia,AUS,2000,19053.19,1.72,541804.65,67.76,6.72
2,India,IND,2000,1006300.3,44.94,1728144.37,64.58,14.07
3,Israel,ISR,2000,6114.57,4.08,129253.89,64.44,10.27
4,Malawi,MWI,2000,11801.5,59.54,5026.22,74.71,11.66
5,South Africa,ZAF,2000,45064.1,6.94,227242.37,72.72,5.73
6,United States,USA,2000,282171.96,1.0,9898700.0,72.35,6.03
7,Uruguay,URY,2000,3219.79,12.1,25255.96,78.98,5.11


We are going to use columns by their index positions, and retrieve slices of DataFrame. Below example retrieves "country isocode", "POP" and "XRAT" slices of columns at the DataFrame.

In [68]:
# Slicing by selected column position

df.iloc[:, [1, 3, 4]]

Unnamed: 0,country isocode,POP,XRAT
0,ARG,37335.65,1.0
1,AUS,19053.19,1.72
2,IND,1006300.3,44.94
3,ISR,6114.57,4.08
4,MWI,11801.5,59.54
5,ZAF,45064.1,6.94
6,USA,282171.96,1.0
7,URY,3219.79,12.1


**2.2 Column Slices by Position Range**

Like slices by column labels, you can also slice a DataFrame by a range of positions.

In [69]:
# Slicing between indexes 1 (inclusive) and 4 (exclusive)

df.iloc[:, 1:4]

Unnamed: 0,country isocode,year,POP
0,ARG,2000,37335.65
1,AUS,2000,19053.19
2,IND,2000,1006300.3
3,ISR,2000,6114.57
4,MWI,2000,11801.5
5,ZAF,2000,45064.1
6,USA,2000,282171.96
7,URY,2000,3219.79


In [70]:
# Slicing from the 3rd index (inclusive) to end

df.iloc[:, 1:]

Unnamed: 0,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,ARG,2000,37335.65,1.0,295072.22,75.72,5.58
1,AUS,2000,19053.19,1.72,541804.65,67.76,6.72
2,IND,2000,1006300.3,44.94,1728144.37,64.58,14.07
3,ISR,2000,6114.57,4.08,129253.89,64.44,10.27
4,MWI,2000,11801.5,59.54,5026.22,74.71,11.66
5,ZAF,2000,45064.1,6.94,227242.37,72.72,5.73
6,USA,2000,282171.96,1.0,9898700.0,72.35,6.03
7,URY,2000,3219.79,12.1,25255.96,78.98,5.11


In [71]:
# Slicing from the beginning to the 2nd index (exclusive)

df.iloc[:, :2]

Unnamed: 0,country,country isocode
0,Argentina,ARG
1,Australia,AUS
2,India,IND
3,Israel,ISR
4,Malawi,MWI
5,South Africa,ZAF
6,United States,USA
7,Uruguay,URY


To get the **last column** use **``df.iloc[:, -1:]``** and to get just **first column** **``df.iloc[:, :1]``**

In [72]:
df.iloc[:, -1]

0                   5.58
1                   6.72
2                  14.07
3                  10.27
4                  11.66
5                   5.73
6                   6.03
7                   5.11
Name: cg, dtype: float64

## BONUS

In [73]:
df

Unnamed: 0,country,country isocode,year,POP,XRAT,tcgdp,cc,cg
0,Argentina,ARG,2000,37335.65,1.0,295072.22,75.72,5.58
1,Australia,AUS,2000,19053.19,1.72,541804.65,67.76,6.72
2,India,IND,2000,1006300.3,44.94,1728144.37,64.58,14.07
3,Israel,ISR,2000,6114.57,4.08,129253.89,64.44,10.27
4,Malawi,MWI,2000,11801.5,59.54,5026.22,74.71,11.66
5,South Africa,ZAF,2000,45064.1,6.94,227242.37,72.72,5.73
6,United States,USA,2000,282171.96,1.0,9898700.0,72.35,6.03
7,Uruguay,URY,2000,3219.79,12.1,25255.96,78.98,5.11


In [74]:
df.iat[2, 3]  # By position

1006300.297

In [75]:
df.at[2, "POP"]  # By label

1006300.297

**``iat[]``** and **``at[]``** gives only a single value output (working with scalar only), so very fast, while **``iloc[]``** and **``loc[]``** can give multiple row output. [Source](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:150%; text-align:center; border-radius:10px 10px;">The End of The Lab-01 Session</p>

<a id="7"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" 
alt="CLRSWY"></p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:100%; text-align:center; border-radius:10px 10px;">WAY TO REINVENT YOURSELF</p>

___