## DATA FRAME

In [1]:
import numpy as np
import pandas as pd

The DataFrafme is the fundamental data structure in Pandas, It represents data in a tabulated, excel spreadsheet like format
A DataFrame is an ordered collection of columns, each of which can be a different value type (numeric, string or boolean). The DataFrame has both a rows and columns with respective index (ie. every both the rows and columns have indices).The index of the rows and columns could be labeled and its worthy to note that every column in a data frame is a series, so we can picture the dataFrame as a collection of series with shared index. 
Through this note we will explore different concepts about DataFrames under the following subsections:

+ Creating a Pandas DataFrame
+ Retrieving Labels and Data
+ Pandas DataFrame Size Attributes
+ Inspecting The DataFrame
+ Accessing and Modifying Data
+ Filtering Data

### Creating a Pandas DataFrame
There are different ways to create a DataFrame, but basically this is initiated with the ``pd.DataFrame`` function. the parameters to this function include

Parameter | Description
:- | :-
data | ndarray (structured or homogeneous), Iterable, dict, or DataFrame Dict can contain Series, arrays, constants, dataclass or list-like objects. 
index | Index or array-like Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.
columns | Index or array-like Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, ..., n) if no column labels are provided.
dtype | dtype, default None Data type to force. Only a single dtype is allowed. If None, infer
copy |  Copy data from inputs
    
lets explore some basic methods to create a DataFrame 
1. Python dictionaries
2. lists of dictionary
3. List of List (Nested list)
4. Two-dimensional NumPy arrays
5. Series
4. Files

#### Python Dictionaries
The keys of the dictionary are the DataFrame’s column labels, and the dictionary values are the data values in the corresponding DataFrame columns. The values can be contained in a tuple, list, one-dimensional NumPy array, Pandas Series object, or one of several other data types. You can also provide a single value that will be copied along the entire column.
It’s possible to control the order of the columns with the columns parameter and the row labels with index:

In [2]:
# Creating DataFrame from Dictionary

d = {"key1":[10,20,30,40,50],
     "key2":[100,200,300,400,500],
     "key3":[1000,2000,3000,4000,5000],
     "key4":["A","B","C","D","E"],
     "Key5":50,
    }
pd.DataFrame(d)

Unnamed: 0,key1,key2,key3,key4,Key5
0,10,100,1000,A,50
1,20,200,2000,B,50
2,30,300,3000,C,50
3,40,400,4000,D,50
4,50,500,5000,E,50


#### List of Dictionary
Again, the dictionary keys are the column labels, and the dictionary values are the data values in the DataFrame.

In [3]:
lst_dict = [{"key1":100,"key2":"one","key3":0.1,"key4":"a"},
      {"key1":200,"key2":"two","key3":0.2,"key4":"b"},
      {"key1":300,"key2":"three","key3":0.3,"key4":"c"},
      {"key1":400,"key2":"four","key3":0.4,"key4":"d"}]

In [4]:
# Creating a DataFrame from a list of Dictionary
pd.DataFrame(lst_dict)

Unnamed: 0,key1,key2,key3,key4
0,100,one,0.1,a
1,200,two,0.2,b
2,300,three,0.3,c
3,400,four,0.4,d


#### List of List (Nested list)
You can also use a nested list, or a list of lists, as the data values. If you do, then it’s wise to explicitly specify the labels of columns, rows, or both when you create the DataFrame:

In [5]:
lst = [[1,10,100,1000],[2,20,200,2000],[3,30,300,3000]]

In [6]:
# Creating a DataFrame from a Nested list
pd.DataFrame(lst,columns=["ten","twenty","thirty","fourty"])

Unnamed: 0,ten,twenty,thirty,fourty
0,1,10,100,1000
1,2,20,200,2000
2,3,30,300,3000


#### Two-dimensional NumPy arrays
You can pass a two-dimensional NumPy array to the DataFrame constructor the same way you do with a list, and in this method you have to specify the columns.

In [7]:
arr = np.random.rand(3,5)

In [8]:
# Creating a DataFrame from a Two Dimensional NumPy  Array
pd.DataFrame(arr, index=["row1","row2","row3"], columns=["col1","col2","col3","col4","col5"])

Unnamed: 0,col1,col2,col3,col4,col5
row1,0.479538,0.480969,0.086426,0.905108,0.493736
row2,0.186208,0.51192,0.568087,0.06254,0.826074
row3,0.802014,0.577162,0.42568,0.938018,0.911402


#### Series
You can also create a DataFrame  from a dictionary of series:

**NB** At this point it is important to note that if the columns parameter is explicitly specified, it will override the column names that were assigned from the dictionary key. if you specify a sequence of columns, the DataFrame columns will be exacctly what you pass. As with sseries if you pass a column that isnt contained in the data, it will appear with NA values in the result.

In [9]:
age = pd.Series([20,23,24,21,32,35], index=["dan","ib","rak","tam","nik","salv"])
height = pd.Series([100,120,111,132,121,131], index=["dan","ib","rak","tam","nik","salv"])
weight = pd.Series([65,73,38,75,54,65], index=["dan","ib","rak","tam","nik","salv"])
ser_dict = {"age":age,"height":height,"weight":weight}
# Creating a DataFrame from a pandas series 
data = pd.DataFrame(ser_dict, columns=["age","weight","height","class"])
data

Unnamed: 0,age,weight,height,class
dan,20,65,100,
ib,23,73,120,
rak,24,38,111,
tam,21,75,132,
nik,32,54,121,
salv,35,65,131,


#### Files
You can save and load the data and labels from a Pandas DataFrame to and from a number of file types, including CSV, Excel, SQL, JSON, and more. This is a very powerful feature.

You can save your job candidate DataFrame to a CSV file with .to_csv():
While reading data from a file, we use the ``pd.read_csv()`` function. The key parameters to note are

parameter | Description 
:- | :-
delimiter | Alias for sep
header | int, list of int, default 'infer' Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to ``header=0`` and column names are inferred from the first line of the file
names |  array-like, optional List of column names to use. If the file contains a header row, then you should explicitly pass ``header=0`` to override the column names. Duplicates in this list are not allowed.
index_col | int, str, sequence of int / str, or False, default ``None`` Column(s) to use as the row labels of the ``DataFrame``, either given as string name or column index.
skiprows | list-like, int or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file
skipfooter | int, default 0 Number of lines at bottom of file to skip (Unsupported with engine='c')

In [10]:
# Reading a csv file to a Pandas DataFrame 
df = pd.read_csv("pokemon_data.csv")

In [11]:
df.head(5)

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False


### Retrieving Labels and Data
You can get the DataFrame’s row labels with **.index** and its column labels with **.columns**, and an array of the data with **.values**

In [12]:
# Retrieving The row labels 
df.index

RangeIndex(start=0, stop=800, step=1)

In [13]:
# Retrieving the column labels
df.columns

Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

In [14]:
# Alternatively we can use the Dictionary styled key() to access the coluimns.
df.keys()

Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

In [15]:
# Retrieving the data values in the DataFrame 
df.values

array([[1, 'Bulbasaur', 'Grass', ..., 45, 1, False],
       [2, 'Ivysaur', 'Grass', ..., 60, 1, False],
       [3, 'Venusaur', 'Grass', ..., 80, 1, False],
       ...,
       [720, 'HoopaHoopa Confined', 'Psychic', ..., 70, 6, True],
       [720, 'HoopaHoopa Unbound', 'Psychic', ..., 80, 6, True],
       [721, 'Volcanion', 'Fire', ..., 70, 6, True]], dtype=object)

### Pandas DataFrame Size Attributes
The attributes ``.ndim``, ``.size``, and ``.shape`` return the number of dimensions, number of data values across each dimension, and total number of data values, respectively:
DataFrame instances have two dimensions (rows and columns), so .ndim returns 2. A Series object, on the other hand, has only a single dimension, so in that case, .ndim would return 1.

The .shape attribute returns a tuple with the number of rows  and the number of columns. Finally, .size returns an integer equal to the number of values in the DataFrame .

In [16]:
# The size (Number of elements in the DataFrame)
df.size

9600

In [17]:
# The shape(Number of rows  and columns )
df.shape

(800, 12)

In [18]:
# Dimension
df.ndim

2

### Inspecting The DataFrame
Here we briefly explore some methods that quickly assist us to inspect our DataFrame:

Function  | Description
:- | :-
df.info | Print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
df.head() | This function returns the first `n` rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.
df.tail() | This function returns last `n` rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.
df.describe() | Generate descriptive statistics. Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding ``NaN`` values

In [19]:
# Print a concise summary of a DataFrame.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      414 non-null    object
 4   HP          800 non-null    int64 
 5   Attack      800 non-null    int64 
 6   Defense     800 non-null    int64 
 7   Sp. Atk     800 non-null    int64 
 8   Sp. Def     800 non-null    int64 
 9   Speed       800 non-null    int64 
 10  Generation  800 non-null    int64 
 11  Legendary   800 non-null    bool  
dtypes: bool(1), int64(8), object(3)
memory usage: 69.7+ KB


In [20]:
# Return the first `n` rows
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False


In [21]:
# Return the last `n` rows
df.tail()

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True
799,721,Volcanion,Fire,Water,80,110,120,130,90,70,6,True


In [22]:
# Generate descriptive statistics.
df.describe()

Unnamed: 0,#,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,362.81375,69.25875,79.00125,73.8425,72.82,71.9025,68.2775,3.32375
std,208.343798,25.534669,32.457366,31.183501,32.722294,27.828916,29.060474,1.66129
min,1.0,1.0,5.0,5.0,10.0,20.0,5.0,1.0
25%,184.75,50.0,55.0,50.0,49.75,50.0,45.0,2.0
50%,364.5,65.0,75.0,70.0,65.0,70.0,65.0,3.0
75%,539.25,80.0,100.0,90.0,95.0,90.0,90.0,5.0
max,721.0,255.0,190.0,230.0,194.0,230.0,180.0,6.0


### Accessing and Modifying Data

**columns**: You can access the column name as you would access an element from a dictionary, by using its label as a key. If the column label is a valid Python identifier, then you can also use dot notation to access the column.

In [23]:
# Accessing  The name column
df["Name"]

0                  Bulbasaur
1                    Ivysaur
2                   Venusaur
3      VenusaurMega Venusaur
4                 Charmander
               ...          
795                  Diancie
796      DiancieMega Diancie
797      HoopaHoopa Confined
798       HoopaHoopa Unbound
799                Volcanion
Name: Name, Length: 800, dtype: object

In [24]:
# Using Fancy indexing to select numerous columns 
df[["Name","Defense",]]

Unnamed: 0,Name,Defense
0,Bulbasaur,49
1,Ivysaur,63
2,Venusaur,83
3,VenusaurMega Venusaur,123
4,Charmander,43
...,...,...
795,Diancie,150
796,DiancieMega Diancie,110
797,HoopaHoopa Confined,60
798,HoopaHoopa Unbound,60


In [25]:
df.Name

0                  Bulbasaur
1                    Ivysaur
2                   Venusaur
3      VenusaurMega Venusaur
4                 Charmander
               ...          
795                  Diancie
796      DiancieMega Diancie
797      HoopaHoopa Confined
798       HoopaHoopa Unbound
799                Volcanion
Name: Name, Length: 800, dtype: object

**indexers:** `loc`, `iloc`, `ix`  
Pandas provides special indexer attributes that explicitely expose certain indexing schemes. these are not  functional methods, but attributes that expose a particular slicing interface  to data in the dataframe. Recall the series Notes, some have been presented. lets review them

**.loc[]** *This is used for the explicit case where we have well definded index labels*  
**.iloc[]** *This is used for the implicit case where we use the index integer location*   

We can use the slice notation [:] to get a range of indices, in the case the end index is included.  
When using the indexers we can specify row and column as follows `loc[row,column]`  


**NB**
>We can use fancy indexing to select numerous columns   
We can use boolean and masking for conditional selection  

In [26]:
data

Unnamed: 0,age,weight,height,class
dan,20,65,100,
ib,23,73,120,
rak,24,38,111,
tam,21,75,132,
nik,32,54,121,
salv,35,65,131,


In [27]:
# indexing a row by its label 
data.loc["dan"]

age        20
weight     65
height    100
class     NaN
Name: dan, dtype: object

In [42]:
# slicing a range of rows
data.loc["dan":"rak"]

Unnamed: 0,age,weight,height,class,state
dan,20,65,100,SS2,Plateau
ib,23,73,120,SS2,Kano
rak,24,38,111,SS2,Bayelsa


In [44]:
# Using Fancy indexing to select multiple columns
data[["age","height"]]

Unnamed: 0,age,height
dan,20,100
ib,23,120
rak,24,111
tam,21,132
nik,32,121
salv,60,131


In [45]:
# selecting a particular value
data.loc["dan":"rak","weight"]

dan    65
ib     73
rak    38
Name: weight, dtype: int64

In [46]:
# Indexing by using implicite (integer location)
data.iloc[0]

age            20
weight         65
height        100
class         SS2
state     Plateau
Name: dan, dtype: object

In [47]:
# selecting a range of rows by their index value
data.iloc[0:3]

Unnamed: 0,age,weight,height,class,state
dan,20,65,100,SS2,Plateau
ib,23,73,120,SS2,Kano
rak,24,38,111,SS2,Bayelsa


In [48]:
# Selecting a particular element.
data.iloc[0:3,1]

dan    65
ib     73
rak    38
Name: weight, dtype: int64

**Modifying DataFrame**  
Columns can be modified by assignment (ie. You can create a new column by calling it like it already exist). You can also use accessors to modify parts of a Pandas DataFrame by passing a Python sequence, NumPy array, or single value:

In [49]:
# Creating a new column
data["state"] = ["Plateau","Kano","Bayelsa","Enugu","Lagos","Rivers",]

In [50]:
data

Unnamed: 0,age,weight,height,class,state
dan,20,65,100,SS2,Plateau
ib,23,73,120,SS2,Kano
rak,24,38,111,SS2,Bayelsa
tam,21,75,132,SS2,Enugu
nik,32,54,121,SS2,Lagos
salv,60,65,131,SS2,Rivers


In [51]:
# Modifying DataFrame by assignment
data.iloc[:,3] = "SS2"

In [52]:
data

Unnamed: 0,age,weight,height,class,state
dan,20,65,100,SS2,Plateau
ib,23,73,120,SS2,Kano
rak,24,38,111,SS2,Bayelsa
tam,21,75,132,SS2,Enugu
nik,32,54,121,SS2,Lagos
salv,60,65,131,SS2,Rivers


In [53]:
# Modifying a single element
data.loc["salv","age"] = 60

In [54]:
data

Unnamed: 0,age,weight,height,class,state
dan,20,65,100,SS2,Plateau
ib,23,73,120,SS2,Kano
rak,24,38,111,SS2,Bayelsa
tam,21,75,132,SS2,Enugu
nik,32,54,121,SS2,Lagos
salv,60,65,131,SS2,Rivers


### Filtering Data
Data filtering is another powerful feature of Pandas. It works similarly to indexing with Boolean arrays in NumPy and Series.
If you apply some logical operation on a Series object, then you’ll get another Series with the Boolean values True and False, which you can mask on the data frame, please refere to boolean selection on NumPy and Series Notes.


In [55]:
# Conditional Selection
data[data["weight"] > 60]

Unnamed: 0,age,weight,height,class,state
dan,20,65,100,SS2,Plateau
ib,23,73,120,SS2,Kano
tam,21,75,132,SS2,Enugu
salv,60,65,131,SS2,Rivers


In [56]:
# Conditional Selection
df[df.Name == "Volcanion"]

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
799,721,Volcanion,Fire,Water,80,110,120,130,90,70,6,True


In [57]:
#  Conditional selection and fancy indexing.
df[(df.Speed > 120) & (df.Attack < 100)][["Name","Attack","Defense","Sp. Atk","Sp. Def","Speed"]]

Unnamed: 0,Name,Attack,Defense,Sp. Atk,Sp. Def,Speed
23,PidgeotMega Pidgeot,80,80,135,80,121
71,AlakazamMega Alakazam,50,65,175,95,150
102,GengarMega Gengar,65,80,170,95,130
109,Electrode,50,70,80,80,140
146,Jolteon,65,60,110,95,130
183,Crobat,90,80,70,80,130
300,Swellow,85,60,50,50,125
315,Ninjask,90,45,50,50,160
339,ManectricMega Manectric,75,80,135,80,135
431,DeoxysSpeed Forme,95,90,95,90,180
