# Big Data Real-Time Analytics with Python and Spark

## Chapter 12 - Apache Spark SQL Part 1

### Lab 3 - Data manipulation with SparkSQL, PandasSQL, SQLAlchemy, PostgreSQL and Docker

![Lab4.png](attachment:Lab4.png)

In [1]:
# Python version
from platform import python_version
print('The version used in this notebook is: ', python_version())

The version used in this notebook is:  3.8.13


In [2]:
# https://pypi.org/project/findspark/
!pip install -q findspark

In [3]:
# Import findspark and inicialize it
import findspark
findspark.init()

In [4]:
# https://pypi.org/project/psycopg2/
!pip install -q psycopg2

In [5]:
# https://pypi.org/project/psycopg2-binary/
!pip install -q psycopg2-binary

In [6]:
# https://www,sqlalchemy,org/
!pip install -q sqlalchemy

In [7]:
# https://pypi.org/project/pandasql/
!pip install -q pandasql

In [8]:
# Imports
import psycopg2 (para conexão ao postgres)
import pandasql
import sqlalchemy
import pandas as pd
from pandasql import sqldf (para extrair dados do postgre)
from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT (Função para definir o nivel de isolamento )
from sqlalchemy import create_engine (Crianção do motor de conexão)
from pyspark.sql import SparkSession 
from pyspark.sql.functions import udf

In [10]:
# Package versions used in this notebook
%reload_ext watermark
%watermark -a 'Bianca Amorim' --iversions

Author: Bianca Amorim

pandasql  : 0.7.3
findspark : 2.0.1
pandas    : 1.4.2
sqlalchemy: 1.4.39
psycopg2  : 2.9.5



## Loading Data with Pandas

In [11]:
# Loading dataset 1 with the name of players
df1 = pd.read_csv('datasets/dataset1.csv', index_col = False)

In [12]:
# Shape
df1.shape

(17588, 2)

In [13]:
# Type of data
df1.dtypes

Name    object
url     object
dtype: object

In [14]:
# Visualize the first columns
df1.head()

Unnamed: 0,Name,url
0,Cristiano Ronaldo,/player/20801/cristiano-ronaldo/
1,Lionel Messi,/player/158023/lionel-messi/
2,Neymar,/player/190871/neymar/
3,Luis Suárez,/player/176580/luis-su%C3%A1rez/
4,Manuel Neuer,/player/167495/manuel-neuer/


In [22]:
# Loading dataset 2 with the name of players
df2 = pd.read_csv('datasets/dataset2.csv', index_col = False)

In [23]:
# Shape
df2.shape

(633, 2)

In [24]:
# Data type
df2.dtypes

Name    object
url     object
dtype: object

In [25]:
# Visualize the first columns
df2.head()

Unnamed: 0,Name,url
0,FC Bayern,/team/21/fc-bayern/
1,Real Madrid,/team/243/real-madrid/
2,FC Barcelona,/team/241/fc-barcelona/
3,Juventus,/team/45/juventus/
4,Manchester Utd,/team/11/manchester-utd/


In [26]:
# Loading dataset 3 with the name of players
df3 = pd.read_csv('datasets/dataset3.csv', index_col = False)

In [27]:
# Shape
df3.shape

(47, 2)

In [29]:
# data type
df3.dtypes

Name    object
url     object
dtype: object

In [30]:
# Visualize the first columns
df3.head()

Unnamed: 0,Name,url
0,Spain,/team/1362/spain/
1,Germany,/team/1337/germany/
2,Brazil,/team/1370/brazil/
3,Belgium,/team/1325/belgium/
4,Argentina,/team/1369/argentina/


In [31]:
# Loading dataset 4 with the name of players
df4 = pd.read_csv('datasets/dataset4.csv', index_col = False)

In [32]:
# Shape
df4.shape

(17588, 53)

In [33]:
# data type
df4.dtypes

Name                   object
Nationality            object
National_Position      object
National_Kit          float64
Club                   object
Club_Position          object
Club_Kit              float64
Club_Joining           object
Contract_Expiry       float64
Rating                  int64
Height                 object
Weight                 object
Preffered_Foot         object
Birth_Date             object
Age                     int64
Preffered_Position     object
Work_Rate              object
Weak_foot               int64
Skill_Moves             int64
Ball_Control            int64
Dribbling               int64
Marking                 int64
Sliding_Tackle          int64
Standing_Tackle         int64
Aggression              int64
Reactions               int64
Attacking_Position      int64
Interceptions           int64
Vision                  int64
Composure               int64
Crossing                int64
Short_Pass              int64
Long_Pass               int64
Accelerati

In [34]:
# Visualize the first columns
df4.head()

Unnamed: 0,Name,Nationality,National_Position,National_Kit,Club,Club_Position,Club_Kit,Club_Joining,Contract_Expiry,Rating,...,Long_Shots,Curve,Freekick_Accuracy,Penalties,Volleys,GK_Positioning,GK_Diving,GK_Kicking,GK_Handling,GK_Reflexes
0,Cristiano Ronaldo,Portugal,LS,7.0,Real Madrid,LW,7.0,07/01/2009,2021.0,94,...,90,81,76,85,88,14,7,15,11,11
1,Lionel Messi,Argentina,RW,10.0,FC Barcelona,RW,10.0,07/01/2004,2018.0,93,...,88,89,90,74,85,14,6,15,11,8
2,Neymar,Brazil,LW,10.0,FC Barcelona,LW,11.0,07/01/2013,2021.0,92,...,77,79,84,81,83,15,9,15,9,11
3,Luis Suárez,Uruguay,LS,9.0,FC Barcelona,ST,9.0,07/11/2014,2021.0,92,...,86,86,84,85,88,33,27,31,25,37
4,Manuel Neuer,Germany,GK,1.0,FC Bayern,GK,1.0,07/01/2011,2021.0,92,...,16,14,11,47,11,91,89,95,90,89


In [35]:
# Summary statistics of numerical variables
df4.describe()

Unnamed: 0,National_Kit,Club_Kit,Contract_Expiry,Rating,Age,Weak_foot,Skill_Moves,Ball_Control,Dribbling,Marking,...,Long_Shots,Curve,Freekick_Accuracy,Penalties,Volleys,GK_Positioning,GK_Diving,GK_Kicking,GK_Handling,GK_Reflexes
count,1075.0,17587.0,17587.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,...,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0,17588.0
mean,12.219535,21.294536,2018.899414,66.166193,25.460314,2.934103,2.303161,57.972766,54.802877,44.230327,...,47.403173,47.181146,43.383443,49.165738,43.275586,16.60962,16.823061,16.458324,16.559814,16.901183
std,6.933187,19.163741,1.698787,7.083012,4.680217,0.655927,0.746156,16.834779,18.913857,21.561703,...,19.211887,18.464396,17.701903,15.871735,17.710839,17.139904,17.798052,16.600741,16.967256,18.034485
min,1.0,1.0,2017.0,45.0,17.0,1.0,1.0,5.0,4.0,3.0,...,4.0,6.0,4.0,7.0,3.0,1.0,1.0,1.0,1.0,1.0
25%,6.0,9.0,2017.0,62.0,22.0,3.0,2.0,53.0,47.0,22.0,...,32.0,34.0,31.0,39.0,30.0,8.0,8.0,8.0,8.0,8.0
50%,12.0,18.0,2019.0,66.0,25.0,3.0,2.0,63.0,60.0,48.0,...,52.0,48.0,42.0,50.0,44.0,11.0,11.0,11.0,11.0,11.0
75%,18.0,27.0,2020.0,71.0,29.0,3.0,3.0,69.0,68.0,64.0,...,63.0,62.0,57.0,61.0,57.0,14.0,14.0,14.0,14.0,14.0
max,36.0,99.0,2023.0,94.0,47.0,5.0,5.0,95.0,97.0,92.0,...,91.0,92.0,93.0,96.0,93.0,91.0,89.0,95.0,91.0,90.0
