#  Pandas information methods

Pandas, and especially DataFrames will serve as data support.

Series and DataFrames have many methods that give us information about the data they contain.

**Series and DataFrame : information methods**
- .info
- .describe
- .count , mean, median, max, std..
- .agg
- .corr
- .grouby

**API Reference de Pandas**

https://pandas.pydata.org/docs/reference/index.html

In [None]:
import pandas as pd

### Dataset Auto MPG
Orixe: UC Irvine Machine Learning Repository

https://archive.ics.uci.edu/ml/datasets/auto+mpg

In [None]:
# The Dataset can be loaded from the Seaborn library, which brings it as a dataset example.
# Load de mpg dataset (from seaborn samples)
# Seaborn is a statistical visualization library.
# Seaborn is based on matplotlib library
# https://seaborn.pydata.org/
# import seaborn as sns
# df = sns.load_dataset('mpg')

In [None]:
# Load dataset from local datasets repository
df = pd.read_csv('../datasets/auto-mpg.csv')

In [None]:
# Show the dataset ; it also shows the number of rows and columns
df

In [None]:
# shape shows directly the dimensions of the dataset
df.shape

In [None]:
# Show a Series / column also shows some information: number and type of data
df.mpg

In [None]:
# show columns and data types
df.dtypes

In [None]:
# Show Dataframe info:
# columns, not null data, data types, memory usage...
df.info()

In [None]:
# basic statistical data, only for numeric variables
df.describe()

- count - number of non-zero values for that column
- mean - represents a mean value of data.
- std (standard deviation) - measure of data dispersion. Mean distance between the data and its average

The next 5 values are the PENTANUMERICAL SUMMARY.
- They divide the space of values in 4 equal parts (in terms of number of data)
- min - minimum value
- 25th percentile, first quartile or Q1: from the minimum to the Q1 value are 25% of the data
- 50th percentile or MEDIAN: from the minimum to the median 50% of the data are found.
- 75th percentile, third quartile or Q3: from the minimum to the Q3 value are 75% of the data
- max - maximum value

In [None]:
# We can force the statistics to be from non-numeric data only.
# The statistics that are calculated are different: count, unique, top, freq.

df.describe(exclude = 'number')

- count - number of non-null values for the column
- unique: number of different values
- top: most repeated value
- freq: number of ocurrences of the most repeated value

### Categorical vs. numerical variables

Simplifying:
- **Numerical** variables: those that can be measured == (**quantitative** variables)
- **Categorical** variables: those that cannot be measured == (**qualitative** variables)

"describe" inform us of the number and type of variables in the dataframe (int+float -> numeric , object -> categorical)

In [None]:
# Summary of statistics applied to a Series/column
# Has the same meaning as in the case of a DataFrame
df.mpg.describe()

In [None]:
# We can apply a function to the whole dataframe to calculate statistics (applied by columns).

In [None]:
df.mean()
#df.mean(numeric_only=True)


In [None]:
df.median()
#df.median(numeric_only=True)

In [None]:
df.max()
#df.agg('max')

In [None]:
# We can also apply the functions only to a series/column.

In [None]:
df.mpg.mean()
#df.mpg.agg('mean')

In [None]:
# It is possible to apply several functions in a single line (either to the whole dataframe or only to a series/column).

In [None]:
df.agg(['mean', 'std'])

In [None]:
df.mpg.agg(['mean', 'std'])

##### Correlation

In our work as data scientists or analysts one of the common objectives can be
to find relationships between different variables.
But,be careful! because correlation between two variables does not imply causality!

**Correlation** measures the relationship between two variables.

Correlation does not explain why there is a relationship between variables, it just 
indicates its existence and gives a measure of its value.

Mantra estatístico nº1: **CORRELATION DOES NOT IMPLY CAUSALITY.**

In [None]:
# Apply the correlation calculation to the whole dataframe with the function "corr".
# The result is a symmetrical table in relation to its diagonal axis.
df.corr()

In [None]:
# Some appreciations:
# - a diagonal is always 1: a variable has the highest correlation with itself
# - weight (weight) and miles per gallon (mpg) have a quite high inverse correlation (-)
# - number of cylinders has a high correlation with horsepower
# - acceleration has almost nothing to do with model year

#### groupby

In [None]:
# The categorical variables can be used to create different groups, on which to apply the same functions that we applied on other variables (mean, max, ...).

In [None]:
# First create the group (a kind of special grouped dataframe).
# Next, apply the function to the new group.

In [None]:
df_groupby_origin = df.groupby('origin')
df_groupby_origin.mean()