<a href="https://colab.research.google.com/github/andrepegoraro/Py/blob/main/Revis%C3%A3o_Pandas_01_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INTRO

First of all, we import `Pandas` Library. Also `NumPy` for Ops and `Matplotlib.pyplot` for graphs and other stuff:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

##Data Structures

##`Series`

Its an unidimensional matrix. Its basically a list that contains values, it can stock any kind of data. Here we can see Messi's amount of goals in the 5 last seasons he played in La Liga

In [2]:
goals = pd.Series([37, 34, 36, 25, 30])
goals

0    37
1    34
2    36
3    25
4    30
dtype: int64

Remember that the left column shows whats known to be the `Index` of each element of the `Series`. We can name each index, otherwise it'll be shown as numbers **starting from 0** (never forget that):

In [3]:
goals.values #Returns the values

array([37, 34, 36, 25, 30])

In [4]:
goals.index #Returns the index

RangeIndex(start=0, stop=5, step=1)

In [5]:
goals = pd.Series([37, 34, 36, 25, 30], index = ['Season 16-17', 'Season 17-18', 'Season 18-19', 'Season 19-20', 'Season 20-21'])
goals

Season 16-17    37
Season 17-18    34
Season 18-19    36
Season 19-20    25
Season 20-21    30
dtype: int64

In [6]:
goals['Season 18-19'] #To acess a value by its index

36

We can also do math ops with the data without even changing the String's former values:

In [7]:
goals**2

Season 16-17    1369
Season 17-18    1156
Season 18-19    1296
Season 19-20     625
Season 20-21     900
dtype: int64

In [8]:
goals #The values are still the same as the former ones

Season 16-17    37
Season 17-18    34
Season 18-19    36
Season 19-20    25
Season 20-21    30
dtype: int64

In [9]:
goals = goals - 10 #Now indeed changing them
goals

Season 16-17    27
Season 17-18    24
Season 18-19    26
Season 19-20    15
Season 20-21    20
dtype: int64

##Statistics Basic Opearations

We can use stat functions to work with the `String`:

In [10]:
goals = goals + 10
goals

Season 16-17    37
Season 17-18    34
Season 18-19    36
Season 19-20    25
Season 20-21    30
dtype: int64

In [11]:
goals.mean()

32.4

In [12]:
goals.median()

34.0

In [13]:
goals.std()

4.9295030175464944

In [14]:
goals.describe() #Which apparently is the equivalent of the "Summary(DataSet)" funtion in RStudio, but for Strings

count     5.000000
mean     32.400000
std       4.929503
min      25.000000
25%      30.000000
50%      34.000000
75%      36.000000
max      37.000000
dtype: float64

###`DataFrame`

Lets review how we can create one. This time in Python, not in R:

In [41]:
data = pd.DataFrame({'Season' : ['14-15', '15-16', '16-17', '17-18', '18-19'],
                     'Goals' : [39, 30, 20, 28, 23],
                     'Assists' : [8,18, 19, 16, 13]})
data

Unnamed: 0,Season,Goals,Assists
0,14-15,39,8
1,15-16,30,18
2,16-17,20,19
3,17-18,28,16
4,18-19,23,13


This `DataFrame` show Neymar's performance (club only) in 5 different seasons. We can get some basic infos of the dataframe by using <font color = "orange">**`.info()`**</font> :

In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Season   5 non-null      object
 1   Goals    5 non-null      int64 
 2   Assists  5 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes


By using this, we got some important info:

- 5 lines
- 3 columns
- columns's `types`
- `DataFrame` storage usage

In [17]:
data.mean() #We can do those same ops that we did with the strings

Goals      28.0
Assists    14.8
dtype: float64

In [18]:
data.mean(axis=1) #doesn't make any sense but just to show what can we do

0    23.5
1    24.0
2    19.5
3    22.0
4    18.0
dtype: float64

Unnamed: 0,Goals,Assists
count,5.0,5.0
mean,28.0,14.8
std,7.314369,4.438468
min,20.0,8.0
25%,23.0,13.0
50%,28.0,16.0
75%,30.0,18.0
max,39.0,19.0


In [23]:
## data = data.set_index('Season') #the line's index becomes each own column

In [24]:
data

Unnamed: 0_level_0,Goals,Assists
Season,Unnamed: 1_level_1,Unnamed: 2_level_1
14-15,39,8
15-16,30,18
16-17,20,19
17-18,28,16
18-19,23,13


In [25]:
data.describe() #Which apparently is the equivalent of the "Summary(DataSet)" funtion in RStudio

Unnamed: 0,Goals,Assists
count,5.0,5.0
mean,28.0,14.8
std,7.314369,4.438468
min,20.0,8.0
25%,23.0,13.0
50%,28.0,16.0
75%,30.0,18.0
max,39.0,19.0


<font color = "orange">**`.describe()`**</font> is a good function and it can help us finding relevant information, just like that:

In [28]:
data.describe(percentiles = [0.05], include = 'all')

Unnamed: 0,Goals,Assists
count,5.0,5.0
mean,28.0,14.8
std,7.314369,4.438468
min,20.0,8.0
5%,20.6,9.0
50%,28.0,16.0
max,39.0,19.0


In [29]:
help(data.describe)

Help on method describe in module pandas.core.generic:

describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False) -> ~FrameOrSeries method of pandas.core.frame.DataFrame instance
    Generate descriptive statistics.
    
    Descriptive statistics include those that summarize the central
    tendency, dispersion and shape of a
    dataset's distribution, excluding ``NaN`` values.
    
    Analyzes both numeric and object series, as well
    as ``DataFrame`` column sets of mixed data types. The output
    will vary depending on what is provided. Refer to the notes
    below for more detail.
    
    Parameters
    ----------
    percentiles : list-like of numbers, optional
        The percentiles to include in the output. All should
        fall between 0 and 1. The default is
        ``[.25, .5, .75]``, which returns the 25th, 50th, and
        75th percentiles.
    include : 'all', list-like of dtypes or None (default), optional
        A white list of data typ

Ok so here is an important thing that you already learnt in R. If you use this function (help(data.describe)), it will return the statistic data of all columns that contain numeric info. So, if we have a column that contains some kind of ID Number (CPF, Matrícula), their data will be analised too. **So always remember to put them as factor**, just like you did in R. Watch the following:

In [42]:
data.insert(1, "Year", [2015, 2016, 2017, 2018, 2019], True) #Here we added a columns to the dataframe
data

Unnamed: 0,Season,Year,Goals,Assists
0,14-15,2015,39,8
1,15-16,2016,30,18
2,16-17,2017,20,19
3,17-18,2018,28,16
4,18-19,2019,23,13


In [43]:
data.describe()

Unnamed: 0,Year,Goals,Assists
count,5.0,5.0,5.0
mean,2017.0,28.0,14.8
std,1.581139,7.314369,4.438468
min,2015.0,20.0,8.0
25%,2016.0,23.0,13.0
50%,2017.0,28.0,16.0
75%,2018.0,30.0,18.0
max,2019.0,39.0,19.0


And now we can just see how it does ops with the *Years* just like it would to any number. What we called *as factors* in R, her in Python we call `as type`:

In [48]:
data["Year"] = data["Year"].astype("category")

In [49]:
data.describe() #That shows us the Years now are categorical variables

Unnamed: 0,Goals,Assists
count,5.0,5.0
mean,28.0,14.8
std,7.314369,4.438468
min,20.0,8.0
25%,23.0,13.0
50%,28.0,16.0
75%,30.0,18.0
max,39.0,19.0


In [51]:
data.columns

Index(['Season', 'Year', 'Goals', 'Assists'], dtype='object')

In [56]:
data['Assists'] 

0     8
1    18
2    19
3    16
4    13
Name: Assists, dtype: int64

In [57]:
data[:3] #Data until the third line

Unnamed: 0,Season,Year,Goals,Assists
0,14-15,2015,39,8
1,15-16,2016,30,18
2,16-17,2017,20,19


We can also sort the lines of the `df` by the values of an especific column:

In [58]:
data.sort_values(by = 'Goals')

Unnamed: 0,Season,Year,Goals,Assists
2,16-17,2017,20,19
4,18-19,2019,23,13
3,17-18,2018,28,16
1,15-16,2016,30,18
0,14-15,2015,39,8


##Selecting Data

Very important section. Here we're gonna see some different ways to find data to work with. Starting by finding data with `index`: selecting various values of a line, column:
 - <font color = "orange">**`.loc[]` </font>**: finds it through the **column's name**:

In [59]:
data.loc[1]

Season     15-16
Year        2016
Goals         30
Assists       18
Name: 1, dtype: object

In [61]:
data.loc[[2, 3]] #We could use '16-17' in the place of "2" if it was the index name

Unnamed: 0,Season,Year,Goals,Assists
2,16-17,2017,20,19
3,17-18,2018,28,16


In [62]:
data.iloc[-1] #last line intel

Season     18-19
Year        2019
Goals         23
Assists       13
Name: 4, dtype: object

In [63]:
data.iloc[0:2, 0]

0    14-15
1    15-16
Name: Season, dtype: object

###Locating by boolean criteria

Selecting data giving conditions to it

In [65]:
data[data['Assists'] > 15]

Unnamed: 0,Season,Year,Goals,Assists
1,15-16,2016,30,18
2,16-17,2017,20,19
3,17-18,2018,28,16


**Operators:** comparing it with Python Operators

* <font color = "orange">**`&` </font>**: Assume a função do <font color = "orange">**`and` </font>**
* <font color = "orange">**`|` </font>**: Assume a função do <font color = "orange">**`or` </font>**
* <font color = "orange">**`~` </font>**: Assume a função do <font color = "orange">**`not`</font>**


In [66]:
data[(data['Goals'] > 25) & (data['Assists'] > 15)]

Unnamed: 0,Season,Year,Goals,Assists
1,15-16,2016,30,18
3,17-18,2018,28,16
