# Analysis and Visualization of Complex Agro-Environmental Data
---
## Exercise 02 - Best charts to represent different data or dataset types.

Each type of data or datasets is best visualized by certain kinds of charts, depending on both the targeted audience as well as the personal preferences of the data visualizer. In this exercise you will first simulate different types of data and datasets in python. Randomizations are useful to understand certain statistical concepts and also as a basis for random sampling, which may be required when dealing with big data. Based on these simulated data, the exercise will consist on choosing the type of chart that you find more adequate to represent the different types of data and datasets.

The objectives of this exercise are to:
1. identify each type of variables and tables that were created.

2. try your best to interpret each line of the code provided. 

2. based on these simulated data, insert markdown boxes to this notebook writing the type of chart that you find more adequate to represent the different types of data and datasets, justifying your choices. You should identify the different axis of the plot, if applicable. Don't forget that drawing sketches might help! You may get some help from this site: https://datavizproject.com/

You will use two modules that provide pseudo-random number generators to implement random sampling routines. Have a look [here](https://docs.python.org/3/library/random.html) (random module) and [here](https://numpy.org/doc/stable/reference/random/index.html) (numpy.random module). Both allow to simulate data and take random samples, although np.random allows more pseudo-random generator methods to be implemented.

To run the simulations, you should first import pandas, numpy and random modules (in case you still did not installed these modules, you will need to install beforehand).


In [1]:
import pandas as pd
import numpy as np
import random

In [2]:
# Simulate var1
var1 = []
random.seed(24) # optional: used to fix the seed of the pseudo-random number generator (use any number of your choice)
levels = ["Permanent crops", "Irrigated crops", "Managed Forest", "Natural Forest", "Agro-Forestry system", "Urban", "Pasture", "Shrubland" ]
for _ in range(100): # a loop is needed because random.sample selects unique elements (with no replacement)
    var1 += random.sample(levels, 1) # var1.append(random.sample(levels, 1)) would also work
print(var1)

['Pasture', 'Managed Forest', 'Natural Forest', 'Managed Forest', 'Natural Forest', 'Managed Forest', 'Irrigated crops', 'Managed Forest', 'Agro-Forestry system', 'Permanent crops', 'Shrubland', 'Shrubland', 'Irrigated crops', 'Permanent crops', 'Managed Forest', 'Shrubland', 'Shrubland', 'Agro-Forestry system', 'Shrubland', 'Irrigated crops', 'Agro-Forestry system', 'Managed Forest', 'Urban', 'Agro-Forestry system', 'Irrigated crops', 'Urban', 'Permanent crops', 'Natural Forest', 'Urban', 'Urban', 'Irrigated crops', 'Agro-Forestry system', 'Irrigated crops', 'Natural Forest', 'Managed Forest', 'Shrubland', 'Agro-Forestry system', 'Agro-Forestry system', 'Agro-Forestry system', 'Natural Forest', 'Irrigated crops', 'Shrubland', 'Managed Forest', 'Managed Forest', 'Urban', 'Shrubland', 'Natural Forest', 'Managed Forest', 'Shrubland', 'Agro-Forestry system', 'Natural Forest', 'Natural Forest', 'Irrigated crops', 'Managed Forest', 'Agro-Forestry system', 'Urban', 'Natural Forest', 'Pasture

In [3]:
# alternative to run a random sampling with replacement (using numpy)
levels = np.array(["Permanent crops", "Irrigated crops", "Managed Forest", "Natural Forest", "Agro-Forestry system", "Urban", "Pasture", "Shrubland"])
sampler = np.random.randint(0, len(levels), 100) # 100 random values within an interval (0 to 7)
var1 = levels.take(sampler) # use sampler to select values from "levels"; take - returns elements from array along the mentioned axis and indices
print(var1)

['Pasture' 'Managed Forest' 'Urban' 'Managed Forest' 'Managed Forest'
 'Permanent crops' 'Urban' 'Agro-Forestry system' 'Pasture'
 'Irrigated crops' 'Irrigated crops' 'Agro-Forestry system'
 'Natural Forest' 'Shrubland' 'Permanent crops' 'Urban' 'Permanent crops'
 'Pasture' 'Irrigated crops' 'Irrigated crops' 'Managed Forest' 'Pasture'
 'Natural Forest' 'Shrubland' 'Urban' 'Managed Forest' 'Shrubland'
 'Agro-Forestry system' 'Irrigated crops' 'Agro-Forestry system' 'Pasture'
 'Pasture' 'Pasture' 'Pasture' 'Pasture' 'Permanent crops' 'Pasture'
 'Natural Forest' 'Urban' 'Irrigated crops' 'Pasture' 'Pasture' 'Urban'
 'Natural Forest' 'Managed Forest' 'Managed Forest' 'Irrigated crops'
 'Shrubland' 'Shrubland' 'Shrubland' 'Natural Forest' 'Shrubland'
 'Permanent crops' 'Urban' 'Irrigated crops' 'Urban' 'Urban'
 'Natural Forest' 'Managed Forest' 'Permanent crops' 'Shrubland'
 'Permanent crops' 'Irrigated crops' 'Natural Forest' 'Urban' 'Shrubland'
 'Urban' 'Irrigated crops' 'Urban' 'Shrubla

In [4]:
sampler = np.random.randint(0, len(levels), 100)
type(sampler)

numpy.ndarray

In [5]:
# Simulate var2
np.random.seed(24) # optional: used to fix the seed of the pseudo-random number generator (use any number of your choice)
var2 = np.random.uniform(0, 100, 100)
print(var2)

[96.00173033 69.95120499 99.98672926 22.00672998 36.1056354  73.98409902
 99.64557251 31.63469778 13.65445798 38.39800102 32.05192836 36.64147531
 70.96515626 90.01424305 53.41154392 24.72937649 67.18065626 56.17291073
 54.25598767 89.34476037 84.27795496 30.60125899 63.11697775 68.02388604
 97.04275604 89.35671519 94.24258614 64.22254812 61.46476338 22.76832544
 48.6031869  80.72192994 84.42201535 53.46808662 75.77980499 49.96768861
 85.03278966 61.96967754 86.16141791 23.16971966 40.22184146 62.43750622
 14.30367059 12.27984836 41.68299108 55.68829821 94.14191754 40.92590225
 73.67514494 99.54506744 91.66643492  0.20232726 97.13316932 88.90481767
 69.94886062  9.75246685 57.34290389 82.00371163 56.08910506 35.07624607
 54.34997561 87.95890917 11.40965649  3.14388054 95.28100604 28.87434744
 44.19491709 25.90215323 59.68914437 65.5286046  27.56954606 85.79724579
 88.87241464 28.50605911 65.95604191 97.21202594 79.68741126 17.94644012
 78.46729779 97.01278886 36.2811769   8.78860648 34

In [6]:
# Simulate table1
table1 = pd.DataFrame(var1).value_counts(sort=True)
table1 = table1.rename_axis("landuse")
table1 = table1.reset_index(name="Frequency")
print(table1)

                landuse  Frequency
0               Pasture         16
1                 Urban         15
2        Natural Forest         14
3             Shrubland         13
4       Irrigated crops         12
5       Permanent crops         12
6        Managed Forest         10
7  Agro-Forestry system          8


### This table should create a Vertical Bar Chart, it has 1 categorical and 1 numerical discrete variable

In [7]:
# Simulate table2
table2 = pd.DataFrame(list(zip(var1, var2)), columns = ["landuse", "cover"])
print(table2)

            landuse      cover
0           Pasture  96.001730
1    Managed Forest  69.951205
2             Urban  99.986729
3    Managed Forest  22.006730
4    Managed Forest  36.105635
..              ...        ...
95          Pasture  27.560264
96   Natural Forest  60.397982
97  Irrigated crops  54.597285
98   Managed Forest  20.978981
99   Natural Forest  13.612275

[100 rows x 2 columns]


### This table should create a BoxPlot, it has 1 categorical and 1 numerical continuous variable

Note: The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together, and then the second item in each passed iterator are paired together, etc. The tuple() function displays a readable version of the result - try running: print(tuple(zip(var1,var2)))

In [8]:
print(tuple(zip(var1,var2)))

((np.str_('Pasture'), np.float64(96.00173033359185)), (np.str_('Managed Forest'), np.float64(69.9512049949576)), (np.str_('Urban'), np.float64(99.98672926238793)), (np.str_('Managed Forest'), np.float64(22.00672997828518)), (np.str_('Managed Forest'), np.float64(36.1056353964058)), (np.str_('Permanent crops'), np.float64(73.9840990209437)), (np.str_('Urban'), np.float64(99.6455725089097)), (np.str_('Agro-Forestry system'), np.float64(31.63469777906084)), (np.str_('Pasture'), np.float64(13.654457982352463)), (np.str_('Irrigated crops'), np.float64(38.39800101516001)), (np.str_('Irrigated crops'), np.float64(32.05192835651931)), (np.str_('Agro-Forestry system'), np.float64(36.64147530835151)), (np.str_('Natural Forest'), np.float64(70.96515625881274)), (np.str_('Shrubland'), np.float64(90.01424305233735)), (np.str_('Permanent crops'), np.float64(53.41154391977205)), (np.str_('Urban'), np.float64(24.729376490994515)), (np.str_('Permanent crops'), np.float64(67.18065625770753)), (np.str_('

In [9]:
# Simulate table3
np.random.seed(24) # optional: used to fix the seed of the pseudo-random number generator (use any number of your choice)
year = list(range(1970,2021))
temp = np.random.normal(17,2,51)
table3 = pd.DataFrame(list(zip(year, temp)), columns = ["Year", "Temperature"])
print(table3)

    Year  Temperature
0   1970    19.658424
1   1971    15.459933
2   1972    16.367439
3   1973    15.018379
4   1974    14.858367
5   1975    14.122573
6   1976    18.128834
7   1977    17.591444
8   1978    13.747192
9   1979    17.439130
10  1980    18.357610
11  1981    20.778545
12  1982    18.923077
13  1983    17.208022
14  1984    16.037669
15  1985    18.700457
16  1986    19.906849
17  1987    19.115475
18  1988    17.331123
19  1989    18.030037
20  1990    14.326129
21  1991    18.125722
22  1992    19.785710
23  1993    16.873344
24  1994    17.243337
25  1995    19.415205
26  1996    16.995920
27  1997    20.255591
28  1998    17.708986
29  1999    19.075055
30  2000    16.228633
31  2001    18.039636
32  2002    20.373166
33  2003    14.348074
34  2004    19.857967
35  2005    12.821291
36  2006    16.740360
37  2007    18.263046
38  2008    15.826924
39  2009    17.581440
40  2010    19.528207
41  2011    17.580070
42  2012    13.059423
43  2013    18.607812
44  2014  

### This table should create a Line Plot, it has 1 time variable and 1 numerical continuous variable

In [10]:
# Simulate table4
xx = np.array([16,21])
yy = np.array([300, 1200])
means = [xx.mean(), yy.mean()]  
stds = [xx.std() / 3, yy.std() / 3]
corr = -0.7 # correlation
covs = [[stds[0]**2          , stds[0]*stds[1]*corr], 
        [stds[0]*stds[1]*corr,           stds[1]**2]] # covariance matrix
table4 = pd.DataFrame(np.random.multivariate_normal(means, covs, 100), columns = ["Mean Anual Temperature", "Total Precipitation"])
print(table4)

    Mean Anual Temperature  Total Precipitation
0                18.294961           909.201074
1                18.556194           684.600944
2                18.213491           840.320436
3                18.157810           755.513792
4                17.785119           824.875035
..                     ...                  ...
95               18.941036           608.551015
96               18.339957           645.584342
97               18.835738           610.889186
98               18.311114           787.298732
99               18.915846           691.382160

[100 rows x 2 columns]


### This table should create a Scatter Plot, it has 2 numerical continuous variable

In [11]:
# Simulate table5
col1 = pd.Series(list(range(1900,2010,10))).repeat(8)
col2 = ["Permanent crops", "Irrigated crops", "Managed Forest", "Natural Forest", "Agro-Forestry system", "Urban", "Pasture", "Shrubland" ]*11
col3 = np.random.uniform(0, 100, 90)
table5 = pd.DataFrame(list(zip(col1, col2, col3)), columns = ["Year", "Landuse", "Cover"])
print(table5)

    Year               Landuse      Cover
0   1900       Permanent crops  77.543675
1   1900       Irrigated crops  38.634305
2   1900        Managed Forest  36.945386
3   1900        Natural Forest  69.217019
4   1900  Agro-Forestry system   4.370542
..   ...                   ...        ...
83  2000        Natural Forest  32.942488
84  2000  Agro-Forestry system  68.643865
85  2000                 Urban  34.579609
86  2000               Pasture  45.485347
87  2000             Shrubland  53.094214

[88 rows x 3 columns]


### This table should create a Stacked Area Chart, it has 1 time variable, 1 categorical and 1 numerical continuous variable