<a href="https://colab.research.google.com/github/bgamboap/intro-data-science/blob/main/DataTypesMeasuresSimplePlots.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Data Science 2020/2021**
Department of Computer Science, Faculty of Sciences, University of Porto

**Practical #2**

# **Types of vars, measures and simple plots**

Use a manually defined vector to try out some simple measures and play with data types.
Let us first define a vector in pandas. Pandas includes the python object *array*. 

In [6]:
import pandas as pd
x = pd.array([10,12,1,5,67])
print(x)

<IntegerArray>
[10, 12, 1, 5, 67]
Length: 5, dtype: Int64


## Types
As we can see, pandas created this as an array of **integers**. The `dtype` (short for data type) `Int64` means these are 64 bits integers. The type was defined automatically, but we can indicate (force) which type we want using the argument `dtype`. If we create another array with a different kind of content, the array method will infer the type. In this case we will use *strings*. These are sequences of characters, very useful to represent words and text, but not only.

In [7]:
s = pd.array(['mon','tue','wed','thu','fri'])
print(s)
print(s[1])
print(s[0:3])

<StringArray>
['mon', 'tue', 'wed', 'thu', 'fri']
Length: 5, dtype: string
tue
<StringArray>
['mon', 'tue', 'wed']
Length: 3, dtype: string


Let's try with non integer numericals and with booleans.

In [8]:
r = pd.array([12.3,14.5,22.95,30,16])
print(r)
b = pd.array([False,False,True,False,True])
print(b)
if(b[4]): 
    print(r[3]*2) 
else: 
    print(r[3])

<PandasArray>
[12.3, 14.5, 22.95, 30.0, 16.0]
Length: 5, dtype: float64
<BooleanArray>
[False, False, True, False, True]
Length: 5, dtype: boolean
60.0


Note that `False` and `True` were recognized as logical values. If we want the string `"False"` we have to add the quotes.

- What is the `if..else` statement doing? Make changes and see the differences.

There are many other types that you may want to explore. But these are the most important ones.

## Exercises with types

1. Try to force the type of the arrays to the non-default values (the floats as strings, the booleans as integers, etc.). Observe the result of these changes. How can they be useful? Think of examples.
2. See the documentation of the pandas method `array`.
3. There is an `array` method in the package `numpy`. This is normal. Different packages can have methods (the functions) with similar names. But be cautious, in general they have different behaviors. These two slightly different arrays are, however, mixable. Anyway, study the differences between the two methods, but use pandas' method in these exercises.

In [9]:
import numpy as np
r1 = pd.array([12.3,14.5,22.95,30,16])
r2 = np.array([12.3,14.5,22.95,30,16])
print(r1+r1)
print(r2+r2)
print(r1+r2)

<PandasArray>
[24.6, 29.0, 45.9, 60.0, 32.0]
Length: 5, dtype: float64
[24.6 29.  45.9 60.  32. ]
<PandasArray>
[24.6, 29.0, 45.9, 60.0, 32.0]
Length: 5, dtype: float64


# Data frames

Data frames are python objects that can store tables as a specific data structure. A *data frame* is a collection of columns. Each column is an array of a specific type. They are typically visualized as tables.

In [14]:
df = pd.DataFrame(data={'weekday':s,'units-sold':x,'satisfaction':r})
print(df)

  weekday  units-sold  satisfaction
0     mon          10         12.30
1     tue          12         14.50
2     wed           1         22.95
3     thu           5         30.00
4     fri          67         16.00


We can read a csv file with data into a data frame using the `read_csv` method.

## Exercises about summarizing data

Download the "adult.csv" data set from kaggle (https://www.kaggle.com/uciml/adult-census-income) and read it as a data frame to your notebook (in fact you can use other datasets).
1. Check the types of the variables loaded in the data frame.
2. Analyse the variables and summarize them using central and dispersion statistical measures.
2. Calculate the mean of a numerical variable without using a pre-defined function.
3. Plot the proportion of values in categorical variables.
3. If we want to see the frequency of the different values of a categorical variable, what is the most adequate plot? Show examples.
3. How can we describe the distribution of a numerical variable using statistical measures?
3. How can we describe the distribution of a numerical variable using simple plots?
3. Find outliers in the numerical variables.
3. Look for positive and negative correlations between the numerical variables.
4. Plot two variables with high absolute correlation using a scatter plot.

In [11]:
adult = pd.read_csv('/content/adult.csv')
ad = pd.DataFrame(data=adult)
print(ad)

       age workclass  fnlwgt  ... hours.per.week  native.country income
0       90         ?   77053  ...             40   United-States  <=50K
1       82   Private  132870  ...             18   United-States  <=50K
2       66         ?  186061  ...             40   United-States  <=50K
3       54   Private  140359  ...             40   United-States  <=50K
4       41   Private  264663  ...             40   United-States  <=50K
...    ...       ...     ...  ...            ...             ...    ...
32556   22   Private  310152  ...             40   United-States  <=50K
32557   27   Private  257302  ...             38   United-States  <=50K
32558   40   Private  154374  ...             40   United-States   >50K
32559   58   Private  151910  ...             40   United-States  <=50K
32560   22   Private  201490  ...             20   United-States  <=50K

[32561 rows x 15 columns]


### 1) Check the types of the variables loaded in the data frame.

In [15]:
ad.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education.num      int64
marital.status    object
occupation        object
relationship      object
race              object
sex               object
capital.gain       int64
capital.loss       int64
hours.per.week     int64
native.country    object
income            object
dtype: object

### 2) Analyse the variables and summarize them using central and dispersion statistical measures.