# Assignment 5: Spark Application (KK3789)

---
## Details

**Use the [Module 9](https://courseworks2.columbia.edu/courses/214510/files/22738546?wrap=1) and [Module 10](https://courseworks2.columbia.edu/courses/214510/files/22738543?wrap=1) class exercises as a reference:**

- Create a new notebook in Google Colab
- Download [Crunchbase ODM Orgs CSV](https://courseworks2.columbia.edu/courses/214510/files/22738559?wrap=1) file and upload it to the "Files" section in your Colab notebook (may take a few minutes to upload)
- Read the Crunchbase Orgs dataset into Spark DataFrame



**Implement PySpark code using DataFrames, RDDs or Spark UDF functions:**

  1. Find all companies with the name that is only **two words** (e.g. : "Goldman Sachs") 
   - print the count of such companies **and show()** only the name and location (city, region, country_code) in the resulting Spark DataFrame
  2. Find all companies located in the state of California:
   - print the count of such companies **and show()** only the name and location (city, region, country_code) in the resulting Spark DataFrame
  3. Add a "Blog" column to the DataFrame with the row entries set to 1 if the "domain" field contains "blogspot.com", and 0 otherwise.
   - show() only the name, location (city, region, country_code) and "Blog" column for companies with the "Blog" field marked as 1
  4. Find all companies with names that are **palindromes** (name reads the same way forward and reverse, e.g. madam) using Spark UDF function:
   - print the count and **show()** only the name and location (city, region, country_code) in the resulting Spark DataFrame 

In [1]:
import os, sys
import urllib
import pandas as pd
import pyspark.sql.functions as F
import re
import unicodedata
from PIL import Image
from tabulate import tabulate
from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql import Window
from pyspark.sql.functions import col, size, split, when, udf, length, lower
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import row_number
from pyspark.sql.types import BooleanType

In [2]:
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

In [3]:
spark = SparkSession \
    .builder \
    .appName("Intro to Apache Spark") \
    .config("spark.cores.max", "4") \
    .config('spark.executor.memory', '8G') \
    .config('spark.driver.maxResultSize', '8g') \
    .config('spark.kryoserializer.buffer.max', '512m') \
    .config("spark.driver.cores", "4") \
    .getOrCreate()

sc = spark.sparkContext
print("Using Apache Spark Version", spark.version)

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/03 21:29:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Using Apache Spark Version 4.0.1


In [112]:
spark.conf.set("spark.sql.debug.maxToStringFields", 1000)
# spark.conf.set("spark.executor.defaultJavaOptions", "-Xmx4g")
# spark.conf.set("spark.driver.defaultJavaOptions", "-Xmx4g")
spark.conf.set("spark.executor.defaultJavaOptions", "-XX:+UseG1GC -XX:MaxGCPauseMillis=100 -Xms4g -Xmx8g")
spark.conf.set("spark.driver.defaultJavaOptions", "-XX:+UseG1GC -XX:MaxGCPauseMillis=100 -Xms4g -Xmx8g")

# Check Data
## Check that data was loaded correctly. 

Sample code from lecture example return the differenct total records between panda dataframe and spark dataframe. \
So I compared all the values between two dataframes and got the correct code for loading data into spark dataframe.

I found that when there is a comma in the description field, the data is stored differently and later on there is a strange value in the name field or the uuid is null. \
So I used an additional option to load the data.

```python
# Sample code From Lecture note.
cb_sdf = spark.read.option("header", "true") \
                   .option("delimiter", ",") \
                   .option("inferSchema", "true") \
                   .csv("crunchbase_odm_orgs (3).csv")
```


```python
# Actual code I used.
cb_sdf = spark.read.option("header", "true") \
                   .option("delimiter", ",") \
                   .option("quote", "\"") \
                   .option("escape", "\"") \
                   .option("multiLine", "true") \
                   .option("inferSchema", "true") \
                   .csv("crunchbase_odm_orgs (3).csv")
```


**I collapsed block for checking values between dataframes below [Hidden]. If you want to see, expend it.**

In [6]:
# [Hidden] Expand this to check the full code block for comparing result between panda DF and spark DF

df = pd.read_csv('crunchbase_odm_orgs (3).csv')
print(f"Panda DataFrame row count: {len(df)}")

# Read CSV into PySpark DataFrame
cb_sdf = spark.read.option("header", "true") \
                   .option("delimiter", ",") \
                   .option("quote", "\"") \
                   .option("escape", "\"") \
                   .option("multiLine", "true") \
                   .option("inferSchema", "true") \
                   .csv("crunchbase_odm_orgs (3).csv")

print(f"PySpark DataFrame row count: {cb_sdf.count()}")

# Step 1: Compare column names
pandas_cols = set(df.columns)
spark_cols = set(cb_sdf.columns)
print("\nColumn Comparison:")
print(f"Pandas columns: {pandas_cols}")
print(f"Columns in Pandas but not PySpark: {pandas_cols - spark_cols}")
print(f"Columns in PySpark but not Pandas: {spark_cols - pandas_cols}")

# Step 2: Compare schemas (data types)
print("\nPySpark Schema:")
cb_sdf.printSchema()
print("\nPandas dtypes:")
print(df.dtypes)

common_cols = cb_sdf.columns

# Add index to DataFrame using zipWithIndex
cb_sdf_indexed = cb_sdf.rdd.zipWithIndex().map(
    lambda row_index: Row(**row_index[0].asDict(), row_index=row_index[1])
).toDF()

# Now proceed like before
total_rows = cb_sdf_indexed.count()
start_idx = 0
chunk_size = 5000
mismatches = []

while start_idx < total_rows:
    print(f"Start Index: {start_idx}")
    
    chunk_sdf = cb_sdf_indexed.filter(
        (cb_sdf_indexed.row_index >= start_idx) & (cb_sdf_indexed.row_index < start_idx + chunk_size)
    )
    chunk_sdf_pd = chunk_sdf.drop("row_index").toPandas()

    for i, row in chunk_sdf_pd.iterrows():
        for col in common_cols:
            val_pandas = row[col]
            val_spark = row[col]  # Already from Spark, converted to Pandas
            if pd.isnull(val_pandas) and pd.isnull(val_spark):
                continue
            if val_pandas != val_spark:
                mismatches.append((start_idx + i, col, val_pandas, val_spark))

    start_idx += chunk_size
    

# Print mismatches
if mismatches:
    print(f"\nFound {len(mismatches)} mismatches:")
    for row_index, col_name, val_pandas, val_spark in mismatches[:100]:  # Show only first 20 for brevity
        print(f"Row {row_index}, Column '{col_name}': Pandas = {val_pandas} | PySpark = {val_spark}")
else:
    print("No mismatches found between the DataFrames.")

25/04/12 17:09:14 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


Panda DataFrame row count: 1127658


                                                                                

PySpark DataFrame row count: 1127658

Column Comparison:
Pandas columns: {'logo_url', 'country_code', 'twitter_url', 'domain', 'city', 'linkedin_url', 'facebook_url', 'combined_stock_symbols', 'type', 'cb_url', 'short_description', 'primary_role', 'region', 'uuid', 'name', 'homepage_url'}
Columns in Pandas but not PySpark: set()
Columns in PySpark but not Pandas: set()

PySpark Schema:
root
 |-- uuid: string (nullable = true)
 |-- name: string (nullable = true)
 |-- type: string (nullable = true)
 |-- primary_role: string (nullable = true)
 |-- cb_url: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- homepage_url: string (nullable = true)
 |-- logo_url: string (nullable = true)
 |-- facebook_url: string (nullable = true)
 |-- twitter_url: string (nullable = true)
 |-- linkedin_url: string (nullable = true)
 |-- combined_stock_symbols: string (nullable = true)
 |-- city: string (nullable = true)
 |-- region: string (nullable = true)
 |-- country_code: string (nullable

                                                                                

Start Index: 0


                                                                                

Start Index: 5000


                                                                                

Start Index: 10000


                                                                                

Start Index: 15000


                                                                                

Start Index: 20000


                                                                                

Start Index: 25000


                                                                                

Start Index: 30000


                                                                                

Start Index: 35000


                                                                                

Start Index: 40000


                                                                                

Start Index: 45000


                                                                                

Start Index: 50000


                                                                                

Start Index: 55000


                                                                                

Start Index: 60000


                                                                                

Start Index: 65000


                                                                                

Start Index: 70000


                                                                                

Start Index: 75000


                                                                                

Start Index: 80000


                                                                                

Start Index: 85000


                                                                                

Start Index: 90000


                                                                                

Start Index: 95000


                                                                                

Start Index: 100000


                                                                                

Start Index: 105000


                                                                                

Start Index: 110000


                                                                                

Start Index: 115000


                                                                                

Start Index: 120000


                                                                                

Start Index: 125000


                                                                                

Start Index: 130000


                                                                                

Start Index: 135000


                                                                                

Start Index: 140000


                                                                                

Start Index: 145000


                                                                                

Start Index: 150000


                                                                                

Start Index: 155000


                                                                                

Start Index: 160000


                                                                                

Start Index: 165000


                                                                                

Start Index: 170000


                                                                                

Start Index: 175000


                                                                                

Start Index: 180000


                                                                                

Start Index: 185000


                                                                                

Start Index: 190000


                                                                                

Start Index: 195000


                                                                                

Start Index: 200000


                                                                                

Start Index: 205000


                                                                                

Start Index: 210000


                                                                                

Start Index: 215000


                                                                                

Start Index: 220000


                                                                                

Start Index: 225000


                                                                                

Start Index: 230000


                                                                                

Start Index: 235000


                                                                                

Start Index: 240000


                                                                                

Start Index: 245000


                                                                                

Start Index: 250000


                                                                                

Start Index: 255000


                                                                                

Start Index: 260000


                                                                                

Start Index: 265000


                                                                                

Start Index: 270000


                                                                                

Start Index: 275000


                                                                                

Start Index: 280000


                                                                                

Start Index: 285000


                                                                                

Start Index: 290000


                                                                                

Start Index: 295000


                                                                                

Start Index: 300000


                                                                                

Start Index: 305000


                                                                                

Start Index: 310000


                                                                                

Start Index: 315000


                                                                                

Start Index: 320000


                                                                                

Start Index: 325000


                                                                                

Start Index: 330000


                                                                                

Start Index: 335000


                                                                                

Start Index: 340000


                                                                                

Start Index: 345000


                                                                                

Start Index: 350000


                                                                                

Start Index: 355000


                                                                                

Start Index: 360000


                                                                                

Start Index: 365000


                                                                                

Start Index: 370000


                                                                                

Start Index: 375000


                                                                                

Start Index: 380000


                                                                                

Start Index: 385000


                                                                                

Start Index: 390000


                                                                                

Start Index: 395000


                                                                                

Start Index: 400000


                                                                                

Start Index: 405000


                                                                                

Start Index: 410000


                                                                                

Start Index: 415000


                                                                                

Start Index: 420000


                                                                                

Start Index: 425000


                                                                                

Start Index: 430000


                                                                                

Start Index: 435000


                                                                                

Start Index: 440000


                                                                                

Start Index: 445000


                                                                                

Start Index: 450000


                                                                                

Start Index: 455000


                                                                                

Start Index: 460000


                                                                                

Start Index: 465000


                                                                                

Start Index: 470000


                                                                                

Start Index: 475000


                                                                                

Start Index: 480000


                                                                                

Start Index: 485000


                                                                                

Start Index: 490000


                                                                                

Start Index: 495000


                                                                                

Start Index: 500000


                                                                                

Start Index: 505000


                                                                                

Start Index: 510000


                                                                                

Start Index: 515000


                                                                                

Start Index: 520000


                                                                                

Start Index: 525000


                                                                                

Start Index: 530000


                                                                                

Start Index: 535000


                                                                                

Start Index: 540000


                                                                                

Start Index: 545000


                                                                                

Start Index: 550000


                                                                                

Start Index: 555000


                                                                                

Start Index: 560000


                                                                                

Start Index: 565000


                                                                                

Start Index: 570000


                                                                                

Start Index: 575000


                                                                                

Start Index: 580000


                                                                                

Start Index: 585000


                                                                                

Start Index: 590000


                                                                                

Start Index: 595000


                                                                                

Start Index: 600000


                                                                                

Start Index: 605000


                                                                                

Start Index: 610000


                                                                                

Start Index: 615000


                                                                                

Start Index: 620000


                                                                                

Start Index: 625000


                                                                                

Start Index: 630000


                                                                                

Start Index: 635000


                                                                                

Start Index: 640000


                                                                                

Start Index: 645000


                                                                                

Start Index: 650000


                                                                                

Start Index: 655000


                                                                                

Start Index: 660000


                                                                                

Start Index: 665000


                                                                                

Start Index: 670000


                                                                                

Start Index: 675000


                                                                                

Start Index: 680000


                                                                                

Start Index: 685000


                                                                                

Start Index: 690000


                                                                                

Start Index: 695000


                                                                                

Start Index: 700000


                                                                                

Start Index: 705000


                                                                                

Start Index: 710000


                                                                                

Start Index: 715000


                                                                                

Start Index: 720000


                                                                                

Start Index: 725000


                                                                                

Start Index: 730000


                                                                                

Start Index: 735000


                                                                                

Start Index: 740000


                                                                                

Start Index: 745000


                                                                                

Start Index: 750000


                                                                                

Start Index: 755000


                                                                                

Start Index: 760000


                                                                                

Start Index: 765000


                                                                                

Start Index: 770000


                                                                                

Start Index: 775000


                                                                                

Start Index: 780000


                                                                                

Start Index: 785000


                                                                                

Start Index: 790000


                                                                                

Start Index: 795000


                                                                                

Start Index: 800000


                                                                                

Start Index: 805000


                                                                                

Start Index: 810000


                                                                                

Start Index: 815000


                                                                                

Start Index: 820000


                                                                                

Start Index: 825000


                                                                                

Start Index: 830000


                                                                                

Start Index: 835000


                                                                                

Start Index: 840000


                                                                                

Start Index: 845000


                                                                                

Start Index: 850000


                                                                                

Start Index: 855000


                                                                                

Start Index: 860000


                                                                                

Start Index: 865000


                                                                                

Start Index: 870000


                                                                                

Start Index: 875000


                                                                                

Start Index: 880000


                                                                                

Start Index: 885000


                                                                                

Start Index: 890000


                                                                                

Start Index: 895000


                                                                                

Start Index: 900000


                                                                                

Start Index: 905000


                                                                                

Start Index: 910000


                                                                                

Start Index: 915000


                                                                                

Start Index: 920000


                                                                                

Start Index: 925000


                                                                                

Start Index: 930000


                                                                                

Start Index: 935000


                                                                                

Start Index: 940000


                                                                                

Start Index: 945000


                                                                                

Start Index: 950000


                                                                                

Start Index: 955000


                                                                                

Start Index: 960000


                                                                                

Start Index: 965000


                                                                                

Start Index: 970000


                                                                                

Start Index: 975000


                                                                                

Start Index: 980000


                                                                                

Start Index: 985000


                                                                                

Start Index: 990000


                                                                                

Start Index: 995000


                                                                                

Start Index: 1000000


                                                                                

Start Index: 1005000


                                                                                

Start Index: 1010000


                                                                                

Start Index: 1015000


                                                                                

Start Index: 1020000


                                                                                

Start Index: 1025000


                                                                                

Start Index: 1030000


                                                                                

Start Index: 1035000


                                                                                

Start Index: 1040000


                                                                                

Start Index: 1045000


                                                                                

Start Index: 1050000


                                                                                

Start Index: 1055000


                                                                                

Start Index: 1060000


                                                                                

Start Index: 1065000


                                                                                

Start Index: 1070000


                                                                                

Start Index: 1075000


                                                                                

Start Index: 1080000


                                                                                

Start Index: 1085000


                                                                                

Start Index: 1090000


                                                                                

Start Index: 1095000


                                                                                

Start Index: 1100000


                                                                                

Start Index: 1105000


                                                                                

Start Index: 1110000


                                                                                

Start Index: 1115000


                                                                                

Start Index: 1120000


                                                                                

Start Index: 1125000


                                                                                

No mismatches found between the DataFrames.


# Load Data

In [288]:
# Read CSV into PySpark DataFrame
cb_sdf = spark.read.option("header", "true") \
                   .option("delimiter", ",") \
                   .option("quote", "\"") \
                   .option("escape", "\"") \
                   .option("multiLine", "true") \
                   .option("inferSchema", "true") \
                   .csv("crunchbase_odm_orgs (3).csv")

print(f"\033[1mPySpark DataFrame row count: {cb_sdf.count()}\033[0m")

# Print schema
cb_sdf.printSchema()

[Stage 1258:>                                                       (0 + 1) / 1]

[1mPySpark DataFrame row count: 1127658[0m
root
 |-- uuid: string (nullable = true)
 |-- name: string (nullable = true)
 |-- type: string (nullable = true)
 |-- primary_role: string (nullable = true)
 |-- cb_url: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- homepage_url: string (nullable = true)
 |-- logo_url: string (nullable = true)
 |-- facebook_url: string (nullable = true)
 |-- twitter_url: string (nullable = true)
 |-- linkedin_url: string (nullable = true)
 |-- combined_stock_symbols: string (nullable = true)
 |-- city: string (nullable = true)
 |-- region: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- short_description: string (nullable = true)



                                                                                

---

# Tasks


1. Find all companies with the name that is only **two words** (e.g. : "Goldman Sachs") 
    - print the count of such companies **and show()** only the name and location (city, region, country_code) in the resulting Spark DataFrame
2. Find all companies located in the state of California:
    - print the count of such companies **and show()** only the name and location (city, region, country_code) in the resulting Spark DataFrame
3. Add a "Blog" column to the DataFrame with the row entries set to 1 if the "domain" field contains "blogspot.com", and 0 otherwise.
    - show() only the name, location (city, region, country_code) and "Blog" column for companies with the "Blog" field marked as 1
4. Find all companies with names that are **palindromes** (name reads the same way forward and reverse, e.g. madam) using Spark UDF function:
    - print the count and **show()** only the name and location (city, region, country_code) in the resulting Spark DataFrame 

---

# 1. Find all companies with the name that is only two words (e.g. : "Goldman Sachs") 
- ### print the count of such companies and show()only the name and location (city, region, country_code) in the resulting Spark DataFrame

In [289]:
# Filter for two-word names and primary_role == 'company'
two_word_name_companies = cb_sdf.filter(
    (size(split(col("name"), " ")) == 2) & 
    (col("primary_role") == "company")
).select("name", "city", "region", "country_code").orderBy(length(col("name")).desc())

# Print count and show DataFrame
print(f"\033[1mTwo-word name companies with primary_role='company': {two_word_name_companies.count()}\033[0m")

# Filter for two-word names only
two_word_name_without_primary = cb_sdf.filter(
    size(split(col("name"), " ")) == 2
).select("name", "city", "region", "country_code").orderBy(length(col("name")).desc())

# Print count and show DataFrame
print(f"\033[1mTwo-word name companies: {two_word_name_without_primary.count()}\033[0m")


# Print differences between with primary role and without primary role
print(f"\033[1mDifferences : {two_word_name_without_primary.count() - two_word_name_companies.count() }\033[0m")


# Print list acending order
two_word_name_companies.orderBy(length(col("name"))).show(truncate=False)

# Print list descending order
two_word_name_companies.orderBy(length(col("name")).desc()).show(truncate=False)

                                                                                

[1mTwo-word name companies with primary_role='company': 359461[0m


                                                                                

[1mTwo-word name companies: 362527[0m


                                                                                

[1mDifferences : 3066[0m


                                                                                

+----+----------------------+-------------------+------------+
|name|city                  |region             |country_code|
+----+----------------------+-------------------+------------+
|H W |Rumford               |Rhode Island       |USA         |
|R B |Dernbach              |Rheinland-Pfalz    |DEU         |
|w i |Hallbergmoos          |Bayern             |DEU         |
|C C |Genève                |Geneve             |CHE         |
|2 1 |Copenhagen            |Hovedstaden        |DNK         |
|T M |San Lazzaro Di Savena |Emilia-Romagna     |ITA         |
|D M |Parma                 |Emilia-Romagna     |ITA         |
|R C |Bad Honnef Am Rhein   |Nordrhein-Westfalen|DEU         |
|E P |Ulm                   |Baden-Wurttemberg  |DEU         |
|J R |Del Mar               |California         |USA         |
|Z D |Keszthely             |Veszprem           |HUN         |
|Ye i|NULL                  |NULL               |NULL        |
|n tv|Köln                  |Nordrhein-Westfalen|DEU   

[Stage 1274:>                                                       (0 + 1) / 1]

+-------------------------------------------------+----------+----------------------+------------+
|name                                             |city      |region                |country_code|
+-------------------------------------------------+----------+----------------------+------------+
|Drogenberatung Rauchmelder.Beratung.App.Community|NULL      |NULL                  |NULL        |
|Elektro-technische Vertriebsgesellschaft         |NULL      |NULL                  |NULL        |
|Rats-Universitätsbuchhandlung Greifswald         |Greifswald|Mecklenburg-Vorpommern|DEU         |
|Drechsler-und Holzspielzeugmacherbetrieb         |Duisburg  |Nordrhein-Westfalen   |DEU         |
|Phillips,Dorsey,Thomas,Waters,& Braford.         |Henderson |North Carolina        |USA         |
|Schokinag-Schokolade-Industrie Herrmann          |Mannheim  |Baden-Wurttemberg     |DEU         |
|VisoTech Softwareentwicklungsges.m.b.H.          |Vienna    |Wien                  |AUT         |
|Unternehm

                                                                                

## Oh wait!! There are special characters in company name!!

It seems like personal preference for "DC &". \
Therefore, I used regex to find a more reasonable company list. 

**Matches:**
- Xavi's Kitchen
- G4 Media
- GRAVIS Computerve

**Does NOT match:**
- Rats-Universitätsbuchhandlung Greifswald
- Phillips,Dorsey,Thomas,Waters,& Braford.
- Net.curity InformationsTechnologien
- DC &

In [266]:
two_word_name_companies_with_regex = cb_sdf.where(
    (col("name").rlike(r"^[A-Za-z0-9']+\s[A-Za-z0-9']+$")) & 
    (col("primary_role") == "company")
).select("name", "city", "region", "country_code")

# Count and show results
print(f"\033[1mTwo-word name companies with primary_role='company': {two_word_name_companies_with_regex.count()}\033[0m")

two_word_name_companies_with_regex_no_primary_role = cb_sdf.where(
    col("name").rlike(r"^[A-Za-z0-9']+\s[A-Za-z0-9']+$")
).select("name", "city", "region", "country_code")

# Count and show results
print(f"\033[1mTwo-word name companies: {two_word_name_companies_with_regex_no_primary_role.count()}\033[0m")

# Print differences between with primary role and without primary role
print(f"\033[1mDifferences : {two_word_name_companies_with_regex_no_primary_role.count() - two_word_name_companies_with_regex.count()}\033[0m")

# Print list acending order
two_word_name_companies_with_regex.orderBy(length(col("name"))).show(truncate=False)

# Print list descending order
two_word_name_companies_with_regex.orderBy(length(col("name")).desc()).show(truncate=False)

                                                                                

[1mTwo-word name companies with primary_role='company': 335999[0m


                                                                                

[1mTwo-word name companies: 338926[0m


                                                                                

[1mDifferences : 2927[0m


                                                                                

+----+---------------------+-------------------+------------+
|name|city                 |region             |country_code|
+----+---------------------+-------------------+------------+
|T M |San Lazzaro Di Savena|Emilia-Romagna     |ITA         |
|H W |Rumford              |Rhode Island       |USA         |
|R B |Dernbach             |Rheinland-Pfalz    |DEU         |
|w i |Hallbergmoos         |Bayern             |DEU         |
|C C |Genève               |Geneve             |CHE         |
|2 1 |Copenhagen           |Hovedstaden        |DNK         |
|D M |Parma                |Emilia-Romagna     |ITA         |
|R C |Bad Honnef Am Rhein  |Nordrhein-Westfalen|DEU         |
|E P |Ulm                  |Baden-Wurttemberg  |DEU         |
|J R |Del Mar              |California         |USA         |
|Z D |Keszthely            |Veszprem           |HUN         |
|n tv|Köln                 |Nordrhein-Westfalen|DEU         |
|C 21|Altrincham           |Cheshire           |GBR         |
|IT S|Fr

[Stage 1054:>                                                       (0 + 1) / 1]

+--------------------------------------+---------------+----------------------+------------+
|name                                  |city           |region                |country_code|
+--------------------------------------+---------------+----------------------+------------+
|KompetensUtvecklingsInstitutet Sverige|Gothenburg     |Vastra Gotaland       |SWE         |
|Rakennuttajatoimisto Valvontakonsultit|Helsinki       |Southern Finland      |FIN         |
|PUBLICITAMEDIA communicatieregisseurs |Ermelo         |Gelderland            |NLD         |
|GRAVIS Computervertriebsgesellschaft  |Berlin         |Berlin                |DEU         |
|Osterreichische Hoteliervereinigung   |NULL           |NULL                  |NULL        |
|Regionalverband FrankfurtRheinMain    |Frankfurt      |Hessen                |DEU         |
|Intellipharmaceutics International    |Toronto        |Ontario               |CAN         |
|NeuroScientific Biopharmaceuticals    |Nedlands       |Western Austra

                                                                                

---

# 2. Find all companies located in the state of California:
- ### print the count of such companies and show()only the name and location (city, region, country_code) in the resulting Spark DataFrame


In [269]:
# Filter for companies located in California with primary_role == 'company'
california_companies = cb_sdf.filter(
    (col("region") == "California") & 
    (col("primary_role") == "company")
).select("name", "city", "region", "country_code")
print(f"\033[1mCalifornia companies with primary_role='company': {california_companies.count()}\033[0m")

# Filter for companies located in California only
california_companies_without_primary = cb_sdf.filter(
    col("region") == "California"
).select("name", "city", "region", "country_code")
print(f"\033[1mCalifornia companies: {california_companies_without_primary.count()}\033[0m")

# Print differences between with primary role and without primary role
print(f"\033[1mDifferences : {california_companies_without_primary.count() - california_companies.count()}\033[0m")

california_companies.show(truncate=False)

                                                                                

[1mCalifornia companies with primary_role='company': 94402[0m


                                                                                

[1mCalifornia companies: 94871[0m


[Stage 1090:>                                                       (0 + 1) / 1]

[1mDifferences : 469[0m
+---------------------+--------------+----------+------------+
|name                 |city          |region    |country_code|
+---------------------+--------------+----------+------------+
|Zoho                 |Pleasanton    |California|USA         |
|Facebook             |Menlo Park    |California|USA         |
|Omnidrive            |Palo Alto     |California|USA         |
|Geni                 |West Hollywood|California|USA         |
|Flektor              |Culver City   |California|USA         |
|Fox Interactive Media|Beverly Hills |California|USA         |
|Twitter              |San Francisco |California|USA         |
|StumbleUpon          |San Francisco |California|USA         |
|Scribd               |San Francisco |California|USA         |
|Slacker              |San Diego     |California|USA         |
|Lala                 |Palo Alto     |California|USA         |
|Helio                |Los Angeles   |California|USA         |
|eBay                 |San Jo

                                                                                

---

# 3. Add a "Blog" column to the DataFrame with the row entries set to 1 if the "domain" field contains "blogspot.com", and 0 otherwise.
- ### show() only the name, location (city, region, country_code) and "Blog" column for companies with the "Blog" field marked as 1

In [275]:
# Add column "Blog" after checking domain contains "blogspot.com" and then filter by 1 in blog. 
blogs = cb_sdf.withColumn("Blog", when(col("domain").contains("blogspot.com"), 1).otherwise(0))
blogs = blogs.filter(
    col("Blog") == 1
).select("name", "city", "region", "country_code", "Blog")
print(f"\033[1mList with Blog = 1: {blogs.count()}\033[0m")

# Add column "Blog" after checking domain contains "blogspot.com" and then filter by 1 in blog with primary_role = 'company'
blogs_with_primary = cb_sdf.withColumn("Blog", when(col("domain").contains("blogspot.com"), 1).otherwise(0))
blogs_with_primary = blogs_with_primary.filter(
    (col("Blog") == 1) &
    (col("primary_role") == "company")
).select("name", "city", "region", "country_code", "Blog")
print(f"\033[1mList with Blog = 1 with primary_role='company': {blogs_with_primary.count()}\033[0m")


# Print differences between with primary role and without primary role
print(f"\033[1mDifferences : {blogs.count() -blogs_with_primary.count()}\033[0m")


blogs.show(truncate=False)

                                                                                

[1mList with Blog = 1: 394[0m


                                                                                

[1mList with Blog = 1 with primary_role='company': 392[0m


[Stage 1144:>                                                       (0 + 1) / 1]

[1mDifferences : 2[0m
+--------------------------+-------------+------------+------------+----+
|name                      |city         |region      |country_code|Blog|
+--------------------------+-------------+------------+------------+----+
|Sad Urdu Poetry           |San Antonio  |Texas       |USA         |1   |
|The Tech-Freak            |Sheffield    |Sheffield   |GBR         |1   |
|Ma.Gnolia                 |San Francisco|California  |USA         |1   |
|Dynasty Online            |NULL         |NULL        |NULL        |1   |
|Hire-seo                  |NULL         |NULL        |NULL        |1   |
|YelloYello                |Rijswijk     |Zuid-Holland|NLD         |1   |
|Youtubehiphop             |São Paulo    |Sao Paulo   |BRA         |1   |
|Payday advances           |NULL         |NULL        |NULL        |1   |
|Blog Traffic Exchange     |Menlo Park   |California  |USA         |1   |
|Sirius Forex Trading Group|NULL         |NULL        |NULL        |1   |
|Utilsforge   

                                                                                

## If domain has upper case for blogspot.com

In [279]:
# Let's try after convert to all lower case for domain and check contain "blogspot.com"
blogs_lower_case = cb_sdf.withColumn("Blog", when(lower(col("domain")).contains("blogspot.com"), 1).otherwise(0))
blogs_lower_case = blogs_lower_case.filter(
    col("Blog") == 1
).select("name", "city", "region", "country_code", "Blog")
print(f"\033[1mList with Blog = 1: {blogs_lower_case.count()}\033[0m")

blogs_lower_case_company_only = cb_sdf.withColumn("Blog", when(lower(col("domain")).contains("blogspot.com"), 1).otherwise(0))
blogs_lower_case_company_only = blogs_lower_case_company_only.filter(
    (col("Blog") == 1) &
    (col("primary_role") == "company")
).select("name", "city", "region", "country_code", "Blog")
print(f"\033[1mList with Blog = 1 and primary_role='company' : {blogs_lower_case_company_only.count()}\033[0m")

# Print differences between with primary role and without primary role
print(f"\033[1mDifferences : {blogs_lower_case.count() -blogs_lower_case_company_only.count()}\033[0m")

blogs_lower_case.show(truncate=False)

                                                                                

[1mList with Blog = 1: 394[0m


                                                                                

[1mList with Blog = 1 and primary_role='company' : 392[0m


[Stage 1196:>                                                       (0 + 1) / 1]

[1mDifferences : 2[0m
+--------------------------+-------------+------------+------------+----+
|name                      |city         |region      |country_code|Blog|
+--------------------------+-------------+------------+------------+----+
|Sad Urdu Poetry           |San Antonio  |Texas       |USA         |1   |
|The Tech-Freak            |Sheffield    |Sheffield   |GBR         |1   |
|Ma.Gnolia                 |San Francisco|California  |USA         |1   |
|Dynasty Online            |NULL         |NULL        |NULL        |1   |
|Hire-seo                  |NULL         |NULL        |NULL        |1   |
|YelloYello                |Rijswijk     |Zuid-Holland|NLD         |1   |
|Youtubehiphop             |São Paulo    |Sao Paulo   |BRA         |1   |
|Payday advances           |NULL         |NULL        |NULL        |1   |
|Blog Traffic Exchange     |Menlo Park   |California  |USA         |1   |
|Sirius Forex Trading Group|NULL         |NULL        |NULL        |1   |
|Utilsforge   

                                                                                

## There is no differences after converting to lower case for domain

---

# 4. Find all companies with names that are palindromes (name reads the same way forward and reverse, e.g. madam) using Spark UDF function:
- ### print the count and show() only the name and location (city, region, country_code) in the resulting Spark DataFrame

In [208]:
def is_palindrome(name):
    if name:
        cleaned_name = "".join(name.lower().split())
        return cleaned_name == cleaned_name[::-1]
    return False

In [209]:
palindrome_udf = udf(is_palindrome, BooleanType())

In [281]:
palindrome_name_companies_primary = cb_sdf.filter(
    (palindrome_udf(col("name"))) &
    (col("primary_role") == "company")
).select("name", "city", "region", "country_code")
 
print(f"\033[1mPalindrome name companies with primary_role='company': {palindrome_name_companies_primary.count()}\033[0m")

palindrome_name_companies = cb_sdf.filter(
    palindrome_udf(col("name"))
).select("name", "city", "region", "country_code")
 
print(f"\033[1mPalindrome name companies': {palindrome_name_companies.count()}\033[0m")

# Print differences between with primary role and without primary role
print(f"\033[1mDifferences : {palindrome_name_companies.count() - palindrome_name_companies_primary.count()}\033[0m")

# Print list acending order
palindrome_name_companies.orderBy(length(col("name"))).show()

# Print list descending order
palindrome_name_companies.orderBy(length(col("name")).desc()).show(truncate=False)

                                                                                

[1mPalindrome name companies with primary_role='company': 1130[0m


                                                                                

[1mPalindrome name companies': 1139[0m


                                                                                

[1mDifferences : 9[0m


                                                                                

+----+-------------+-------------+------------+
|name|         city|       region|country_code|
+----+-------------+-------------+------------+
|  55|        Paris|Ile-de-France|         FRA|
|  EE|     Hatfield|     Hertford|         GBR|
|  99|    São Paulo|    Sao Paulo|         BRA|
|  NN| Johnson City|    Tennessee|         USA|
|  __|         NULL|         NULL|        NULL|
|  MM| Wakamatsucho|        Tokyo|         JPN|
|  KK|     Hangzhou|     Zhejiang|         CHN|
|  DD|      Fairfax|     Virginia|         USA|
|  Gg|     Shanghai|     Shanghai|         CHN|
|  MM|        Miami|      Florida|         USA|
|  FF|      Glasgow| Glasgow City|         GBR|
| B+B|         NULL|         NULL|        NULL|
| SRS|     Montvale|   New Jersey|         USA|
| CBC|       Ottawa|      Ontario|         CAN|
| OQO|San Francisco|   California|         USA|
| WOW|      Chicago|     Illinois|         USA|
| e4e|  Santa Clara|   California|         USA|
| AVA|       Makati|       Manila|      

[Stage 1227:>                                                       (0 + 1) / 1]

+-----------------+--------------+--------------------------+------------+
|name             |city          |region                    |country_code|
+-----------------+--------------+--------------------------+------------+
|Never Odd or Even|NULL          |NULL                      |NULL        |
|Radley Yeldar    |London        |England                   |GBR         |
|Oolaboobaloo     |Solliès-toucas|Provence-Alpes-Cote d'Azur|FRA         |
|Level Level      |Rotterdam     |Zuid-Holland              |NLD         |
|igrenEnergi      |Bangalore     |Karnataka                 |IND         |
|Azzip Pizza      |NULL          |NULL                      |NULL        |
|xxxxxxxxxxx      |Burlington    |Massachusetts             |USA         |
|elbaC Cable      |Bueil         |Haute-Normandie           |FRA         |
|AIDEM MEDIA      |Manchester    |Manchester                |GBR         |
|Ossorosso        |Cassano       |Campania                  |ITA         |
|Seremeres        |London

                                                                                

# What if it is case sensitive palindromes.

In [282]:
def is_palindrome_case_sensitive(name):
    if name:
        cleaned_name = "".join(name.split())
        return cleaned_name == cleaned_name[::-1]
    return False

In [283]:
palindrome_case_sensitive = udf(is_palindrome_case_sensitive, BooleanType())

In [284]:
palindrome_name_companies_primary = cb_sdf.filter(
    (palindrome_case_sensitive(col("name"))) &
    (col("primary_role") == "company")
).select("name", "city", "region", "country_code")
 
print(f"\033[1mPalindrome name companies with primary_role='company': {palindrome_name_companies_primary.count()}\033[0m")

palindrome_name_companies = cb_sdf.filter(
    palindrome_case_sensitive(col("name"))
).select("name", "city", "region", "country_code")
 
print(f"\033[1mPalindrome name companies': {palindrome_name_companies.count()}\033[0m")

# Print differences between with primary role and without primary role
print(f"\033[1mDifferences : {palindrome_name_companies.count() - palindrome_name_companies_primary.count()}\033[0m")

# Print list acending order
palindrome_name_companies.orderBy(length(col("name"))).show()

# Print list descending order
palindrome_name_companies.orderBy(length(col("name")).desc()).show(truncate=False)

                                                                                

[1mPalindrome name companies with primary_role='company': 801[0m


                                                                                

[1mPalindrome name companies': 810[0m


                                                                                

[1mDifferences : 9[0m


                                                                                

+----+-------------+----------------+------------+
|name|         city|          region|country_code|
+----+-------------+----------------+------------+
|  55|        Paris|   Ile-de-France|         FRA|
|  EE|     Hatfield|        Hertford|         GBR|
|  99|    São Paulo|       Sao Paulo|         BRA|
|  NN| Johnson City|       Tennessee|         USA|
|  __|         NULL|            NULL|        NULL|
|  MM| Wakamatsucho|           Tokyo|         JPN|
|  KK|     Hangzhou|        Zhejiang|         CHN|
|  DD|      Fairfax|        Virginia|         USA|
|  MM|        Miami|         Florida|         USA|
|  FF|      Glasgow|    Glasgow City|         GBR|
| MHM|       Vienna|        Virginia|         USA|
| SIS|        Milan|       Lombardia|         ITA|
| i4i|      Toronto|         Ontario|         CAN|
| DRD|   Chesterton|         Indiana|         USA|
| mmm|    Vancouver|British Columbia|         CAN|
| OKO|      Hayling|       Hampshire|         GBR|
| uhu|         NULL|           

[Stage 1241:>                                                       (0 + 1) / 1]

+-----------+-------------+---------------+------------+
|name       |city         |region         |country_code|
+-----------+-------------+---------------+------------+
|igrenEnergi|Bangalore    |Karnataka      |IND         |
|xxxxxxxxxxx|Burlington   |Massachusetts  |USA         |
|elbaC Cable|Bueil        |Haute-Normandie|FRA         |
|AIDEM MEDIA|Manchester   |Manchester     |GBR         |
|erreqerre  |Barcelona    |Catalonia      |ESP         |
|ADOOGOODA  |San Francisco|California     |USA         |
|TOBOROBOT  |NULL         |NULL           |NULL        |
|TIER REIT  |Dallas       |Texas          |USA         |
|semilimes  |Zofingen     |Aargau         |CHE         |
|metsystem  |Secunderabad |Andhra Pradesh |IND         |
|Man88naM   |NULL         |NULL           |NULL        |
|itiliti    |Minneapolis  |Minnesota      |USA         |
|oolaloo    |NULL         |NULL           |NULL        |
|RNDMDNR    |NULL         |NULL           |NULL        |
|ecoloce    |San Francisco|Cali

                                                                                

# What if it check number and alphabets only.

In [290]:
def is_palindrome_ignore_symbols(name):
    if name:
        cleaned_name = "".join(c.lower() for c in name if c.isalnum())
        
        # Optional: skip if it's too short to be meaningful
        if len(cleaned_name) < 2:
            return False
        return cleaned_name == cleaned_name[::-1]
    return False

In [291]:
palindrome_ignore_symbols = udf(is_palindrome_ignore_symbols, BooleanType())

In [292]:
palindrome_name_companies_primary = cb_sdf.filter(
    (palindrome_ignore_symbols(col("name"))) &
    (col("primary_role") == "company")
).select("name", "city", "region", "country_code")
 
print(f"\033[1mPalindrome name companies with primary_role='company': {palindrome_name_companies_primary.count()}\033[0m")

palindrome_name_companies = cb_sdf.filter(
    palindrome_ignore_symbols(col("name"))
).select("name", "city", "region", "country_code")
 
print(f"\033[1mPalindrome name companies': {palindrome_name_companies.count()}\033[0m")

# Print differences between with primary role and without primary role
print(f"\033[1mDifferences : {palindrome_name_companies.count() - palindrome_name_companies_primary.count()}\033[0m")

# Print list acending order
palindrome_name_companies.orderBy(length(col("name"))).show()

# Print list descending order
palindrome_name_companies.orderBy(length(col("name")).desc()).show(truncate=False)

                                                                                

[1mPalindrome name companies with primary_role='company': 1177[0m


                                                                                

[1mPalindrome name companies': 1186[0m


                                                                                

[1mDifferences : 9[0m


                                                                                

+----+-------------+-------------+------------+
|name|         city|       region|country_code|
+----+-------------+-------------+------------+
|  55|        Paris|Ile-de-France|         FRA|
|  EE|     Hatfield|     Hertford|         GBR|
|  99|    São Paulo|    Sao Paulo|         BRA|
|  NN| Johnson City|    Tennessee|         USA|
|  MM| Wakamatsucho|        Tokyo|         JPN|
|  KK|     Hangzhou|     Zhejiang|         CHN|
|  DD|      Fairfax|     Virginia|         USA|
|  Gg|     Shanghai|     Shanghai|         CHN|
|  MM|        Miami|      Florida|         USA|
|  FF|      Glasgow| Glasgow City|         GBR|
| B+B|         NULL|         NULL|        NULL|
| SRS|     Montvale|   New Jersey|         USA|
| CBC|       Ottawa|      Ontario|         CAN|
| OQO|San Francisco|   California|         USA|
| WOW|      Chicago|     Illinois|         USA|
| e4e|  Santa Clara|   California|         USA|
| AVA|       Makati|       Manila|         PHL|
| ivi|       Moscow|  Moscow City|      

[Stage 1288:>                                                       (0 + 1) / 1]

+-----------------+--------------+--------------------------+------------+
|name             |city          |region                    |country_code|
+-----------------+--------------+--------------------------+------------+
|Never Odd or Even|NULL          |NULL                      |NULL        |
|Radley Yeldar    |London        |England                   |GBR         |
|Oolaboobaloo     |Solliès-toucas|Provence-Alpes-Cote d'Azur|FRA         |
|igrenEnergi      |Bangalore     |Karnataka                 |IND         |
|Azzip Pizza      |NULL          |NULL                      |NULL        |
|xxxxxxxxxxx      |Burlington    |Massachusetts             |USA         |
|elbaC Cable      |Bueil         |Haute-Normandie           |FRA         |
|AIDEM MEDIA      |Manchester    |Manchester                |GBR         |
|Level Level      |Rotterdam     |Zuid-Holland              |NLD         |
|Ini & mini       |De Meern      |Utrecht                   |NLD         |
|Ossorosso        |Cassan

                                                                                

# THANK YOU !!