<a href="https://colab.research.google.com/github/arbakaydemir/PySparkCommands/blob/main/Third_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

### Overview of the Dataset

The `dataset.json` file contains comprehensive data on the population statistics of various countries around the world. This dataset includes key demographic and socio-economic indicators for each country, providing valuable insights into global population trends.

### Dataset Description

The dataset comprises the following columns:

- **position**: The rank of the country based on its population size.
- **country**: The name of the country.
- **population**: The total population of the country.
- **yearly_change**: The annual percentage change in the population.
- **net_change**: The net change in the population over the past year.
- **density_per_square_km**: The population density, measured as the number of people per square kilometer.
- **land_area_in_square_km**: The total land area of the country in square kilometers.
- **migrants_net**: The net number of migrants, indicating the difference between the number of people entering and leaving the country.
- **fertility_rate**: The average number of children born to a woman over her lifetime.
- **median_age**: The median age of the population.
- **urban_population**: The percentage of the population living in urban areas.
- **world_share**: The percentage of the world's population that resides in the country.

### Purpose of the Dataset

This dataset is intended for use in demographic analysis, population studies, and socio-economic research. It can be utilized to:

- Analyze population growth trends and patterns.
- Study the impact of migration on population dynamics.
- Examine the relationship between population density and land area.
- Investigate fertility rates and their implications for future population growth.
- Explore the distribution of urban and rural populations

# Install and Import necessary Libraries and Pyspark

In [3]:
# Install PySpark in Google Colab
!pip install pyspark

#Import the pySpark
import pyspark



In [4]:
#Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

#Creating a Spark Session
spark = SparkSession.builder.appName("Practise_PySpark").getOrCreate()

In [5]:
#Mounting google drive to access dataset
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


## Load a JSON File

In [6]:
#Loading the dataset
file_path = "/content/gdrive/MyDrive/Colab Notebooks/Third Project/dataset.json"

# Read the json file into a DataFrame
df = spark.read.option("multiline","true").json(file_path)

In [7]:
# Display the DataFrame
df.show()

+-------------+---------------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|      country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|   population|position|urban_population|world_share|yearly_change|
+-------------+---------------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|        China|                  153|           1.7|            9,388,211|        38|    -348,399| 5,540,090|1,439,323,776|       1|            61 %|    18.47 %|       0.39 %|
|        India|                  464|           2.2|            2,973,190|        28|    -532,687|13,586,631|1,380,004,385|       2|            35 %|    17.70 %|       0.99 %|
|United States|                   36|           1.8|            9,147,420|        38|     954,806| 1,937,734|  331,002,6

In [57]:
df.count()

235

In [59]:
df.describe().show()

+-------+-----------+---------------------+------------------+---------------------+------------------+-------------------+------------------+----------+-----------------+----------------+-----------+-------------+
|summary|    country|density_per_square_km|    fertility_rate|land_are_in_square_km|        median_age|       migrants_net|        net_change|population|         position|urban_population|world_share|yearly_change|
+-------+-----------+---------------------+------------------+---------------------+------------------+-------------------+------------------+----------+-----------------+----------------+-----------+-------------+
|  count|        235|                  235|               235|                  235|               235|                235|               235|       235|              235|             235|        235|          235|
|   mean|       NULL|   151.39461883408072|2.6930348258706465|    326.2692307692308|30.606965174129353|-144.07407407407408|223.4489795918367

In [61]:
display(df)

DataFrame[country: string, density_per_square_km: string, fertility_rate: string, land_are_in_square_km: string, median_age: string, migrants_net: string, net_change: string, population: string, position: string, urban_population: string, world_share: string, yearly_change: string]

In [62]:
df.distinct().count()

235

In [56]:
df.printSchema()

root
 |-- country: string (nullable = true)
 |-- density_per_square_km: string (nullable = true)
 |-- fertility_rate: string (nullable = true)
 |-- land_are_in_square_km: string (nullable = true)
 |-- median_age: string (nullable = true)
 |-- migrants_net: string (nullable = true)
 |-- net_change: string (nullable = true)
 |-- population: string (nullable = true)
 |-- position: string (nullable = true)
 |-- urban_population: string (nullable = true)
 |-- world_share: string (nullable = true)
 |-- yearly_change: string (nullable = true)



# DataFrame Operations

Drop a Column

In [9]:
df_new = df.drop("density_per_square_km")
df_new.show()

+-------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|      country|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|   population|position|urban_population|world_share|yearly_change|
+-------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|        China|           1.7|            9,388,211|        38|    -348,399| 5,540,090|1,439,323,776|       1|            61 %|    18.47 %|       0.39 %|
|        India|           2.2|            2,973,190|        28|    -532,687|13,586,631|1,380,004,385|       2|            35 %|    17.70 %|       0.99 %|
|United States|           1.8|            9,147,420|        38|     954,806| 1,937,734|  331,002,651|       3|            83 %|     4.25 %|       0.59 %|
|    Indonesia|           2.3|            1,811,570|        30|     -98,955|

Change a column name

In [10]:
df_1 = df.withColumnRenamed("country", "Country")
df_1.show()

+-------------+---------------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|      Country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|   population|position|urban_population|world_share|yearly_change|
+-------------+---------------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|        China|                  153|           1.7|            9,388,211|        38|    -348,399| 5,540,090|1,439,323,776|       1|            61 %|    18.47 %|       0.39 %|
|        India|                  464|           2.2|            2,973,190|        28|    -532,687|13,586,631|1,380,004,385|       2|            35 %|    17.70 %|       0.99 %|
|United States|                   36|           1.8|            9,147,420|        38|     954,806| 1,937,734|  331,002,6

Change multiple column names

In [11]:
df_2 = df.withColumnRenamed("country", "Country").withColumnRenamed(
    "density_per_square_km", "density_squarekm"
)
df_2.show()

+-------------+----------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|      Country|density_squarekm|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|   population|position|urban_population|world_share|yearly_change|
+-------------+----------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|        China|             153|           1.7|            9,388,211|        38|    -348,399| 5,540,090|1,439,323,776|       1|            61 %|    18.47 %|       0.39 %|
|        India|             464|           2.2|            2,973,190|        28|    -532,687|13,586,631|1,380,004,385|       2|            35 %|    17.70 %|       0.99 %|
|United States|              36|           1.8|            9,147,420|        38|     954,806| 1,937,734|  331,002,651|       3|            83 %| 

pyspark.sql.DataFrame.withColumnsRenamed

In [12]:
df_3 = df.withColumnsRenamed(
    {"country": "Country", "density_per_square_km": "density_squarekm"}
)
df_3.show()

+-------------+----------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|      Country|density_squarekm|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|   population|position|urban_population|world_share|yearly_change|
+-------------+----------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|        China|             153|           1.7|            9,388,211|        38|    -348,399| 5,540,090|1,439,323,776|       1|            61 %|    18.47 %|       0.39 %|
|        India|             464|           2.2|            2,973,190|        28|    -532,687|13,586,631|1,380,004,385|       2|            35 %|    17.70 %|       0.99 %|
|United States|              36|           1.8|            9,147,420|        38|     954,806| 1,937,734|  331,002,651|       3|            83 %| 

# pyspark.sql.DataFrame.select🏨

In [13]:
df.select("country").show()
#This is useful when you'd like to select a single column

+-------------+
|      country|
+-------------+
|        China|
|        India|
|United States|
|    Indonesia|
|     Pakistan|
|       Brazil|
|      Nigeria|
|   Bangladesh|
|       Russia|
|       Mexico|
|        Japan|
|     Ethiopia|
|  Philippines|
|        Egypt|
|      Vietnam|
|     DR Congo|
|       Turkey|
|         Iran|
|      Germany|
|     Thailand|
+-------------+
only showing top 20 rows



In [14]:
df.select('*').show()

+-------------+---------------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|      country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|   population|position|urban_population|world_share|yearly_change|
+-------------+---------------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|        China|                  153|           1.7|            9,388,211|        38|    -348,399| 5,540,090|1,439,323,776|       1|            61 %|    18.47 %|       0.39 %|
|        India|                  464|           2.2|            2,973,190|        28|    -532,687|13,586,631|1,380,004,385|       2|            35 %|    17.70 %|       0.99 %|
|United States|                   36|           1.8|            9,147,420|        38|     954,806| 1,937,734|  331,002,6

In [15]:
df.select(["country"]).show()
#This is useful when you'd like to select a multiple columns

+-------------+
|      country|
+-------------+
|        China|
|        India|
|United States|
|    Indonesia|
|     Pakistan|
|       Brazil|
|      Nigeria|
|   Bangladesh|
|       Russia|
|       Mexico|
|        Japan|
|     Ethiopia|
|  Philippines|
|        Egypt|
|      Vietnam|
|     DR Congo|
|       Turkey|
|         Iran|
|      Germany|
|     Thailand|
+-------------+
only showing top 20 rows



In [16]:
df.select(["country", "density_per_square_km"]).show()
#Useful when you have a dynamic list of columns.

#Can be easily modified programmatically.

+-------------+---------------------+
|      country|density_per_square_km|
+-------------+---------------------+
|        China|                  153|
|        India|                  464|
|United States|                   36|
|    Indonesia|                  151|
|     Pakistan|                  287|
|       Brazil|                   25|
|      Nigeria|                  226|
|   Bangladesh|                1,265|
|       Russia|                    9|
|       Mexico|                   66|
|        Japan|                  347|
|     Ethiopia|                  115|
|  Philippines|                  368|
|        Egypt|                  103|
|      Vietnam|                  314|
|     DR Congo|                   40|
|       Turkey|                  110|
|         Iran|                   52|
|      Germany|                  240|
|     Thailand|                  137|
+-------------+---------------------+
only showing top 20 rows



In [17]:
df.select("country", "density_per_square_km").show()
#The columns are specified as multiple string arguments.

#Convenient for hardcoding a fixed set of columns.

#Slightly more concise for a small number of columns.

+-------------+---------------------+
|      country|density_per_square_km|
+-------------+---------------------+
|        China|                  153|
|        India|                  464|
|United States|                   36|
|    Indonesia|                  151|
|     Pakistan|                  287|
|       Brazil|                   25|
|      Nigeria|                  226|
|   Bangladesh|                1,265|
|       Russia|                    9|
|       Mexico|                   66|
|        Japan|                  347|
|     Ethiopia|                  115|
|  Philippines|                  368|
|        Egypt|                  103|
|      Vietnam|                  314|
|     DR Congo|                   40|
|       Turkey|                  110|
|         Iran|                   52|
|      Germany|                  240|
|     Thailand|                  137|
+-------------+---------------------+
only showing top 20 rows



Below code adds 1000 to each row in density_per_square_km column

In [18]:
df.select(df["country"],df["density_per_square_km"]+1000).show()

+-------------+------------------------------+
|      country|(density_per_square_km + 1000)|
+-------------+------------------------------+
|        China|                        1153.0|
|        India|                        1464.0|
|United States|                        1036.0|
|    Indonesia|                        1151.0|
|     Pakistan|                        1287.0|
|       Brazil|                        1025.0|
|      Nigeria|                        1226.0|
|   Bangladesh|                          NULL|
|       Russia|                        1009.0|
|       Mexico|                        1066.0|
|        Japan|                        1347.0|
|     Ethiopia|                        1115.0|
|  Philippines|                        1368.0|
|        Egypt|                        1103.0|
|      Vietnam|                        1314.0|
|     DR Congo|                        1040.0|
|       Turkey|                        1110.0|
|         Iran|                        1052.0|
|      German

In [19]:
df.select(df["country"],(df["population"] > 100000000).alias("is_population_greater_than_100M")).show()

+-------------+-------------------------------+
|      country|is_population_greater_than_100M|
+-------------+-------------------------------+
|        China|                           NULL|
|        India|                           NULL|
|United States|                           NULL|
|    Indonesia|                           NULL|
|     Pakistan|                           NULL|
|       Brazil|                           NULL|
|      Nigeria|                           NULL|
|   Bangladesh|                           NULL|
|       Russia|                           NULL|
|       Mexico|                           NULL|
|        Japan|                           NULL|
|     Ethiopia|                           NULL|
|  Philippines|                           NULL|
|        Egypt|                           NULL|
|      Vietnam|                           NULL|
|     DR Congo|                           NULL|
|       Turkey|                           NULL|
|         Iran|                         

In [20]:
df.select(col("country"), (col("population") > 100000000).alias("is_population_greater_than_100M")).show()

+-------------+-------------------------------+
|      country|is_population_greater_than_100M|
+-------------+-------------------------------+
|        China|                           NULL|
|        India|                           NULL|
|United States|                           NULL|
|    Indonesia|                           NULL|
|     Pakistan|                           NULL|
|       Brazil|                           NULL|
|      Nigeria|                           NULL|
|   Bangladesh|                           NULL|
|       Russia|                           NULL|
|       Mexico|                           NULL|
|        Japan|                           NULL|
|     Ethiopia|                           NULL|
|  Philippines|                           NULL|
|        Egypt|                           NULL|
|      Vietnam|                           NULL|
|     DR Congo|                           NULL|
|       Turkey|                           NULL|
|         Iran|                         

In [21]:
df.printSchema()

root
 |-- country: string (nullable = true)
 |-- density_per_square_km: string (nullable = true)
 |-- fertility_rate: string (nullable = true)
 |-- land_are_in_square_km: string (nullable = true)
 |-- median_age: string (nullable = true)
 |-- migrants_net: string (nullable = true)
 |-- net_change: string (nullable = true)
 |-- population: string (nullable = true)
 |-- position: string (nullable = true)
 |-- urban_population: string (nullable = true)
 |-- world_share: string (nullable = true)
 |-- yearly_change: string (nullable = true)



## Data Cleaning Conditions;

**In case of dot**:
It is advisible to change datatype to double or float in case we have data such as '2.5' as string. We don't need to remove dots if we want to convert them to a numerical data type. The dot represents a decimal point, and retaining it is essential for preserving the fractional part of the number.

**In case of comma**: Let's say we have comma in the data, and they intend to represent decimal points. In those conditions, we must convert commas to dots, and then we can cast the strings to a nmumerical data type such as float or double.


**In case of percentage**: First we need to remove the percentage sign and then convert the remaining string to a numerical data type, such as float. After removing, we should divide the resulting number by 100 to convert it to a decimal. Check below example:

`# Remove percentage sign and cast to float`

`df = df.withColumn("value", regexp_replace(col("value"), " %", "").cast("float") / 100`

**In case of negative number**: In this condition, we don't need to remove any character. We can directly convert it to an integer.

**In case of negative number contains comma:**
1. Comma as Thousands Separator:
If the comma is intended to separate thousands, you can remove the comma and then cast the string to an integer or float.

`# Remove commas and cast to integer`

`df = df.withColumn("value", regexp_replace(col("value"), ",", "").cast("integer"))`

2. Comma as Decimal Separator:
If the comma is meant to be a decimal point, you should replace the comma with a dot before casting it to a float or double.

`# Replace commas with dots and cast to float`

`df = df.withColumn("value", regexp_replace(col("value"), ",", ".").cast("float"))`

## Data Cleaning Conditions Part 2;
**In case of plus sign with commas:**

Remove the plus sign: The plus sign typically indicates that the value is greater than or equal to the specified number.

Remove commas: If the commas are used as thousands separators.

Cast to an appropriate numerical type: Depending on your need, you can cast it to an integer or float.

`# Remove plus sign, remove commas, and cast to integer`
`df = df.withColumn("value", regexp_replace(col("value"), "[,+]", "").cast("integer"))`

**Using different methods to do same job**

`# Remove plus sign, remove commas, and cast to integer`
`df = df.withColumn("value", regexp_replace(col("value"), "[,+]", "").cast("integer"))`


Alternatively, we can choose to use

`df = df.withColumn("value", regexp_replace(col("value"), "[^0-9]", "").cast("integer"))`

[^0-9]:

This regex pattern matches any character that is not a digit (0-9).

The regexp_replace function will remove all non-digit characters from the "value" column.

**Differences and Similarities:**

**Differences:**

Scope of Removal: The first snippet only removes commas and plus signs, whereas the second snippet removes all non-digit characters. This means the second snippet is more general and can handle various types of non-digit characters.

Use Case: The first snippet is suitable for cases where you specifically want to remove commas and plus signs. The second snippet is suitable for cases where you want to remove any non-numeric characters, ensuring that only digits remain.

**Similarities:**

Outcome: For the specific examples given (like "1,000,000+"), both snippets will produce the same cleaned value ("1000000").

Casting: Both snippets cast the cleaned value to an integer using the cast("integer") method.

**When to Use Each Snippet:**

Use Code Snippet 1:

When your data specifically contains commas and plus signs that you want to remove.

Example: "1,000,000+" to "1000000".

Use Code Snippet 2:

When your data may contain various non-digit characters, and you want to ensure that only numeric digits remain.

Example: "1,000,000+ USD" or "3.5%" to "1000000" and "35", respectively.

## Decision of integer, float and double

**Precision:**

Float: A float (single-precision) has a precision of about 7 decimal digits. It occupies 4 bytes (32 bits) of memory.

Double: A double (double-precision) has a precision of about 15-16 decimal digits. It occupies 8 bytes (64 bits) of memory.

**Range:**

Float: Can represent a smaller range of values compared to double. It’s typically used for numerical data that does not require a high degree of precision.

Double: Can represent a larger range of values and is used when precision is more critical.

**Memory Usage:**

Float: Requires less memory (4 bytes) and is faster to process.

Double: Requires more memory (8 bytes) but offers greater precision and range.

**When to Use Float:**
Performance: If memory usage and performance are critical (e.g., large datasets or real-time processing) and the precision of 7 decimal digits is sufficient.

Scientific Calculations: Often used in scientific calculations where a rough estimate is acceptable.

**When to Use Double:**
High Precision: When dealing with financial data, scientific calculations, or any application where precise decimal representation is important.

Large Range: When the range of values is significant, such as in large numerical datasets.

**When to use Integer:**
Precision: Exact numerical values without fractional parts.

Use Case: When your data consists of whole numbers.

Example: Population counts, number of items, etc.

## Data Cleaning Necessities:

We need to do some cleaning for below issues.

Let's list them

*   Commas in "density per square km", "land are in square km", "migrants_net", "net change", "population"
*   Percentage sign in "urban population", "world share" and "yearly change" columns.

In [22]:
df.show()

+-------------+---------------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|      country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|   population|position|urban_population|world_share|yearly_change|
+-------------+---------------------+--------------+---------------------+----------+------------+----------+-------------+--------+----------------+-----------+-------------+
|        China|                  153|           1.7|            9,388,211|        38|    -348,399| 5,540,090|1,439,323,776|       1|            61 %|    18.47 %|       0.39 %|
|        India|                  464|           2.2|            2,973,190|        28|    -532,687|13,586,631|1,380,004,385|       2|            35 %|    17.70 %|       0.99 %|
|United States|                   36|           1.8|            9,147,420|        38|     954,806| 1,937,734|  331,002,6

In [23]:
# Let's start with first cleaning necessity. Commas in "density per square km", "land are in square km", "migrants_net", "net change", "population"

df_new1 = df.withColumn("density_per_square_km", regexp_replace(col("density_per_square_km"), ",", "").cast("integer"))\
          .withColumn("land_are_in_square_km", regexp_replace(col("land_are_in_square_km"), ",", "").cast("integer"))\
          .withColumn("migrants_net", regexp_replace(col("migrants_net"), ",", "").cast("integer"))\
          .withColumn("net_change", regexp_replace(col("net_change"), ",", "").cast("integer"))\
          .withColumn("population", regexp_replace(col("population"), ",", "").cast("integer"))\
          .withColumn("fertility_rate", col("fertility_rate").cast("float"))

In [24]:
df_new1.show()
df_new1.printSchema()

+-------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|      country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+-------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|        China|                  153|           1.7|              9388211|        38|     -348399|   5540090|1439323776|       1|            61 %|    18.47 %|       0.39 %|
|        India|                  464|           2.2|              2973190|        28|     -532687|  13586631|1380004385|       2|            35 %|    17.70 %|       0.99 %|
|United States|                   36|           1.8|              9147420|        38|      954806|   1937734| 331002651|       3|      

First necessity is completed succesfully with the above code lines. Now it is time to complete second necessity.

Percentage sign in "urban population", "world share" and "yearly change" columns. We don't need to remove dots.

In [25]:
#Let's start to clean percentage signs, and then we will be converting dataType to the integer in urban population as it doesn't have any decimal point. Then we will convert world_share and yearly change columns into float. Lastly, we need to convert those values into its correct decimal representation to reflect its proper numberical value. For example, "18.47%" should be converted to 0.1847 to represent 18.47 percent as a decimal.
df_new1 = df_new1.withColumn("urban_population", round(regexp_replace(col("urban_population"), "%", "").cast("float") / 100, 5))\
          .withColumn("world_share", round(regexp_replace(col("world_share"), "%", "").cast("float") / 100, 7))\
          .withColumn("yearly_change", round(regexp_replace(col("yearly_change"), "%", "").cast("float") / 100, 7))

In [26]:
df_new1.show()
df_new1.printSchema()

+-------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|      country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+-------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|        China|                  153|           1.7|              9388211|        38|     -348399|   5540090|1439323776|       1|            0.61|     0.1847|       0.0039|
|        India|                  464|           2.2|              2973190|        28|     -532687|  13586631|1380004385|       2|            0.35|      0.177|       0.0099|
|United States|                   36|           1.8|              9147420|        38|      954806|   1937734| 331002651|       3|      

## Changing Data Type of a Single Column:

In [27]:
df_new1 = df_new1.withColumn("position", col("position").cast("integer"))

In [28]:
df_new1.show(1)

+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|  China|                  153|           1.7|              9388211|        38|     -348399|   5540090|1439323776|       1|            0.61|     0.1847|       0.0039|
+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
only showing top 1 row



Changing Data Type of Multiple Columns:

In [29]:
df_new1 = df_new1.withColumn("median_age", col("median_age").cast("integer"))

In [30]:
df_new1.printSchema()

root
 |-- country: string (nullable = true)
 |-- density_per_square_km: integer (nullable = true)
 |-- fertility_rate: float (nullable = true)
 |-- land_are_in_square_km: integer (nullable = true)
 |-- median_age: integer (nullable = true)
 |-- migrants_net: integer (nullable = true)
 |-- net_change: integer (nullable = true)
 |-- population: integer (nullable = true)
 |-- position: integer (nullable = true)
 |-- urban_population: double (nullable = true)
 |-- world_share: double (nullable = true)
 |-- yearly_change: double (nullable = true)



In [31]:
df_new1.select(df_new1["country"],(df_new1["population"] > 100000000).alias("is_population_greater_than_100M")).show()

+-------------+-------------------------------+
|      country|is_population_greater_than_100M|
+-------------+-------------------------------+
|        China|                           true|
|        India|                           true|
|United States|                           true|
|    Indonesia|                           true|
|     Pakistan|                           true|
|       Brazil|                           true|
|      Nigeria|                           true|
|   Bangladesh|                           true|
|       Russia|                           true|
|       Mexico|                           true|
|        Japan|                           true|
|     Ethiopia|                           true|
|  Philippines|                           true|
|        Egypt|                           true|
|      Vietnam|                          false|
|     DR Congo|                          false|
|       Turkey|                          false|
|         Iran|                         

In [32]:
df_new1.select(col("country"), (col("population") > 100000000).alias("is_population_greater_than_100M")).show()

+-------------+-------------------------------+
|      country|is_population_greater_than_100M|
+-------------+-------------------------------+
|        China|                           true|
|        India|                           true|
|United States|                           true|
|    Indonesia|                           true|
|     Pakistan|                           true|
|       Brazil|                           true|
|      Nigeria|                           true|
|   Bangladesh|                           true|
|       Russia|                           true|
|       Mexico|                           true|
|        Japan|                           true|
|     Ethiopia|                           true|
|  Philippines|                           true|
|        Egypt|                           true|
|      Vietnam|                          false|
|     DR Congo|                          false|
|       Turkey|                          false|
|         Iran|                         

In [33]:
 from pyspark.sql import functions as F

## When and Otherwise

In [34]:
df_new1.select(
    col("country"),
    F.when(col("population") > 100000000, col("population"))
    .otherwise("Less than 100M")
    .alias("is_population_greater_than_100M")
    ).show()

+-------------+-------------------------------+
|      country|is_population_greater_than_100M|
+-------------+-------------------------------+
|        China|                     1439323776|
|        India|                     1380004385|
|United States|                      331002651|
|    Indonesia|                      273523615|
|     Pakistan|                      220892340|
|       Brazil|                      212559417|
|      Nigeria|                      206139589|
|   Bangladesh|                      164689383|
|       Russia|                      145934462|
|       Mexico|                      128932753|
|        Japan|                      126476461|
|     Ethiopia|                      114963588|
|  Philippines|                      109581078|
|        Egypt|                      102334404|
|      Vietnam|                 Less than 100M|
|     DR Congo|                 Less than 100M|
|       Turkey|                 Less than 100M|
|         Iran|                 Less tha

## WHEN AND FILTER TOGETHER

In [35]:
df_filtered = df_new1.filter(col("population") > 100000000).select(
    col("country"),
    col("population").alias("population_greater_than_100M")
)

df_filtered.show()

+-------------+----------------------------+
|      country|population_greater_than_100M|
+-------------+----------------------------+
|        China|                  1439323776|
|        India|                  1380004385|
|United States|                   331002651|
|    Indonesia|                   273523615|
|     Pakistan|                   220892340|
|       Brazil|                   212559417|
|      Nigeria|                   206139589|
|   Bangladesh|                   164689383|
|       Russia|                   145934462|
|       Mexico|                   128932753|
|        Japan|                   126476461|
|     Ethiopia|                   114963588|
|  Philippines|                   109581078|
|        Egypt|                   102334404|
+-------------+----------------------------+



## Startswith - Endswith

In [36]:
df_startwith = df_new1.filter(col("country").startswith("A"))
df_startwith.show()

df_startwith2 = df_new1.select(df_new1.country.startswith("A"))
df_startwith2.show()

+-------------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|            country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+-------------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|          Argentina|                   17|           2.3|              2736690|        32|        4800|    415097|  45195774|      32|            0.93|     0.0058|       0.0093|
|            Algeria|                   18|           3.1|              2381740|        29|      -10000|    797990|  43851044|      33|            0.73|     0.0056|       0.0185|
|        Afghanistan|                   60|           4.6|               652860|        18|      -62920| 

In [37]:
df_filtered = df_new1.filter(col("population") > 100000000)
df_filtered.show()

df_filtered = df_new1.where(col("population") > 100000000)
df_filtered.show()

+-------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|      country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+-------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|        China|                  153|           1.7|              9388211|        38|     -348399|   5540090|1439323776|       1|            0.61|     0.1847|       0.0039|
|        India|                  464|           2.2|              2973190|        28|     -532687|  13586631|1380004385|       2|            0.35|      0.177|       0.0099|
|United States|                   36|           1.8|              9147420|        38|      954806|   1937734| 331002651|       3|      

## Get the size of a DataFrame

In [38]:
print("{} rows".format(df_new1.count()))
print("{} columns".format(len(df_new1.columns)))

235 rows
12 columns


## Get a DataFrame's number of partitions:

Performance Tuning: Understanding the number of partitions can help you optimize the performance of your Spark jobs. Adjusting the number of partitions can lead to better parallelism and resource utilization.

In [39]:
print("{} partition(s)".format(df_new1.rdd.getNumPartitions()))

1 partition(s)


What is an RDD?

Resilient Distributed Dataset (RDD): RDDs are the fundamental data structure of Apache Spark. They represent an immutable, distributed collection of objects that can be processed in parallel.

Key Features of RDDs:

Resilient: Fault-tolerant with the ability to recompute missing or damaged partitions due to node failures.

Distributed: Data is distributed across multiple nodes in a cluster, allowing parallel processing.

Dataset: A collection of data elements.

What is a Partition?

Partition: A partition is a logical division of data in an RDD. Each partition is a chunk of data that can be processed independently by a task in Spark.

Key Points:

Partitions enable parallelism: Multiple partitions can be processed simultaneously on different nodes in the cluster.

The number of partitions can impact performance: More partitions can lead to better load balancing and resource utilization, while fewer partitions might reduce the overhead of managing partitions.

Example:
Imagine you have a large dataset of a billion rows. Instead of processing the entire dataset as a single unit, Spark divides it into smaller partitions, say 1000 partitions. Each partition contains a subset of the data, and Spark processes these partitions in parallel across the cluster, making the computation much faster and more efficient.

Checking the Number of Partitions:
The code print("{} partition(s)".format(df_new1.rdd.getNumPartitions())) is used to find out how many partitions are present in the RDD of the DataFrame df_new1.

## Get data types of a DataFrame's columns

In [40]:
print(df_new1.dtypes)

[('country', 'string'), ('density_per_square_km', 'int'), ('fertility_rate', 'float'), ('land_are_in_square_km', 'int'), ('median_age', 'int'), ('migrants_net', 'int'), ('net_change', 'int'), ('population', 'int'), ('position', 'int'), ('urban_population', 'double'), ('world_share', 'double'), ('yearly_change', 'double')]


## Fill NULL values in specific columns

In [41]:
df_A = df_new1.fillna({"population": 0})
df_A.select(col("country"), col("population")).show()

+-------------+----------+
|      country|population|
+-------------+----------+
|        China|1439323776|
|        India|1380004385|
|United States| 331002651|
|    Indonesia| 273523615|
|     Pakistan| 220892340|
|       Brazil| 212559417|
|      Nigeria| 206139589|
|   Bangladesh| 164689383|
|       Russia| 145934462|
|       Mexico| 128932753|
|        Japan| 126476461|
|     Ethiopia| 114963588|
|  Philippines| 109581078|
|        Egypt| 102334404|
|      Vietnam|  97338579|
|     DR Congo|  89561403|
|       Turkey|  84339067|
|         Iran|  83992949|
|      Germany|  83783942|
|     Thailand|  69799978|
+-------------+----------+
only showing top 20 rows



## Count the Number if 0 in the column

In [42]:
zero_count = df_A.filter(col("population") == 0).count()
print("Number of rows with population = 0:", zero_count)

Number of rows with population = 0: 0


## Fill NULL values with column average

In [43]:
df_A = df_A.fillna({"population": df_A.agg(avg("population")).first()[0]})

df_A.show()

+-------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|      country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+-------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|        China|                  153|           1.7|              9388211|        38|     -348399|   5540090|1439323776|       1|            0.61|     0.1847|       0.0039|
|        India|                  464|           2.2|              2973190|        28|     -532687|  13586631|1380004385|       2|            0.35|      0.177|       0.0099|
|United States|                   36|           1.8|              9147420|        38|      954806|   1937734| 331002651|       3|      

## Filter based on a specific column value

In [44]:
df_new1.where(col("country") == "China").show()

df_new1.filter(col("country") == "China").show()

df_new1.filter(col("fertility_rate") > 1.5).show()

+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|  China|                  153|           1.7|              9388211|        38|     -348399|   5540090|1439323776|       1|            0.61|     0.1847|       0.0039|
+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+

+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+------------

## Multiple Columns in Filter

In [45]:
df_new1.filter(((col("population") > 50000000) & (col("fertility_rate") > 5))).show()

+--------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
| country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+--------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
| Nigeria|                  226|           5.4|               910770|        18|      -60000|   5175990| 206139589|       7|            0.52|     0.0264|       0.0258|
|DR Congo|                   40|           6.0|              2267050|        17|       23861|   2770836|  89561403|      16|            0.46|     0.0115|       0.0319|
+--------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------

## Filter based on an IN list

In [46]:
from pyspark.sql.functions import col

df_new1.where(col("country").isin(["Germany", "Turkey"])).show()

+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
| Turkey|                  110|           2.1|               769630|        32|      283922|    909452|  84339067|      17|            0.76|     0.0108|       0.0109|
|Germany|                  240|           1.6|               348560|        46|      543822|    266897|  83783942|      19|            0.76|     0.0107|       0.0032|
+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------

## Filter based on a NOT IN list

In [47]:
df_zor = df_new1.where(~col("country").isin(["China"]))

In [48]:
df_zor.select(col("fertility_rate"),col("country").isin(["China"])).show()

+--------------+--------------------+
|fertility_rate|(country IN (China))|
+--------------+--------------------+
|           2.2|               false|
|           1.8|               false|
|           2.3|               false|
|           3.6|               false|
|           1.7|               false|
|           5.4|               false|
|           2.1|               false|
|           1.8|               false|
|           2.1|               false|
|           1.4|               false|
|           4.3|               false|
|           2.6|               false|
|           3.3|               false|
|           2.1|               false|
|           6.0|               false|
|           2.1|               false|
|           2.2|               false|
|           1.6|               false|
|           1.5|               false|
|           1.8|               false|
+--------------+--------------------+
only showing top 20 rows



## Get Dataframe rows that match a substring

In [49]:
df_new1.where(df_new1.country.contains("Turkey")).show()

+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
| Turkey|                  110|           2.1|               769630|        32|      283922|    909452|  84339067|      17|            0.76|     0.0108|       0.0109|
+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+



## Filter a Dataframe based on a custom substring search

In [50]:
df_new1.where(col("country").like("T%")).show()

+-------------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|            country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+-------------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|             Turkey|                  110|           2.1|               769630|        32|      283922|    909452|  84339067|      17|            0.76|     0.0108|       0.0109|
|           Thailand|                  137|           1.5|               510890|        40|       19444|    174396|  69799978|      20|            0.51|      0.009|       0.0025|
|           Tanzania|                   67|           4.9|               885800|        18|      -40076| 

## Filter based on a column's length

In [51]:
df_new1.where(length(col("country")) < 5).show()

+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+-------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|   Iran|                   52|           2.2|              1628550|        32|      -55000|   1079043|  83992949|      18|            0.76|     0.0108|        0.013|
|   Iraq|                   93|           3.7|               434320|        21|        7834|    912710|  40222493|      36|            0.73|     0.0052|       0.0232|
|   Peru|                   26|           2.3|              1280000|        31|       99069|    461401|  32971854|      43|            0.79|     0.0042|       0.0142

## Multiple filter conditions

AND (&) Operator: Rows must meet all conditions to be included.

OR (|) Operator: Rows can meet any of the conditions to be included.

In [52]:
df_new1.filter(((col("population") < 500000000) & (col("fertility_rate") < 2))).show()

df_new1.filter(((col("population") < 500000000) | (col("fertility_rate") < 2))).show()

+--------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|       country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+--------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
| United States|                   36|           1.8|              9147420|        38|      954806|   1937734| 331002651|       3|            0.83|     0.0425|       0.0059|
|        Brazil|                   25|           1.7|              8358140|        33|       21200|   1509890| 212559417|       6|            0.88|     0.0273|       0.0072|
|        Russia|                    9|           1.8|             16376870|        40|      182456|     62206| 145934462|       9|

## **Some of the most commonly used operators:**

**Comparison Operators:**

Equality: ==

Not Equal: !=

Greater Than: >

Greater Than or Equal To: >=

Less Than: <

Less Than or Equal To: <=


**Logical Operators:**

AND: &

OR: |

NOT: ~

## Sort DataFrame by a column and Limit

In [53]:
df_new1.orderBy(col("population").desc()).limit(5).show()

+-------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|      country|density_per_square_km|fertility_rate|land_are_in_square_km|median_age|migrants_net|net_change|population|position|urban_population|world_share|yearly_change|
+-------------+---------------------+--------------+---------------------+----------+------------+----------+----------+--------+----------------+-----------+-------------+
|        China|                  153|           1.7|              9388211|        38|     -348399|   5540090|1439323776|       1|            0.61|     0.1847|       0.0039|
|        India|                  464|           2.2|              2973190|        28|     -532687|  13586631|1380004385|       2|            0.35|      0.177|       0.0099|
|United States|                   36|           1.8|              9147420|        38|      954806|   1937734| 331002651|       3|      

## Check for Duplicates and Count: Use groupBy and count to identify duplicate values and their counts.

In [54]:
# Group by the column "population" and count the occurrences
duplicates_df = df_new1.groupBy("country").count()

# Filter to show only the duplicates
duplicates_df = duplicates_df.filter(col("count") > 1)

# Display the duplicates and their counts
duplicates_df.show()

+-------+-----+
|country|count|
+-------+-----+
+-------+-----+



## Check for Duplicates in Each Column: Iterate over each column, group by that column, count the occurrences, and filter to show only duplicates.

In [55]:
from pyspark.sql.functions import col, count

def check_duplicates(df_new1):
    columns = df_new1.columns
    for column in columns:
        duplicates_df1 = df_new1.groupBy(column).count().filter(col("count") > 1).orderBy(col("count").desc())
        print(f"Duplicates in column '{column}':")
        duplicates_df1.show()

# Check for duplicates in each column of df_new1
check_duplicates(df_new1)


Duplicates in column 'country':
+-------+-----+
|country|count|
+-------+-----+
+-------+-----+

Duplicates in column 'density_per_square_km':
+---------------------+-----+
|density_per_square_km|count|
+---------------------+-----+
|                   25|    8|
|                    4|    6|
|                   16|    5|
|                   83|    5|
|                   18|    5|
|                  137|    3|
|                   26|    3|
|                    3|    3|
|                   20|    3|
|                   17|    3|
|                   53|    2|
|                  115|    2|
|                   76|    2|
|                  103|    2|
|                  111|    2|
|                   47|    2|
|                   13|    2|
|                   40|    2|
|                  164|    2|
|                   94|    2|
+---------------------+-----+
only showing top 20 rows

Duplicates in column 'fertility_rate':
+--------------+-----+
|fertility_rate|count|
+--------------+-----+
|  