-- Notepad to myself --

# Working with Columns

Both the *pandas* syntax and the *PySpark* syntax is provided.

### Setup environment

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() 

In [6]:
from pyspark.sql.functions import to_timestamp, col
df = spark.read.csv('Crimes-2021.csv', header=True, inferSchema=True) \
    .withColumn('Date', to_timestamp(col('Date'),'MM/dd/yyyy hh:mm:ss a'))
df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Case Number: string (nullable = true)
 |-- Date: timestamp (nullable = true)
 |-- Block: string (nullable = true)
 |-- IUCR: string (nullable = true)
 |-- Primary Type: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Location Description: string (nullable = true)
 |-- Arrest: boolean (nullable = true)
 |-- Domestic: boolean (nullable = true)
 |-- Beat: integer (nullable = true)
 |-- District: integer (nullable = true)
 |-- Ward: integer (nullable = true)
 |-- Community Area: integer (nullable = true)
 |-- FBI Code: string (nullable = true)
 |-- X Coordinate: integer (nullable = true)
 |-- Y Coordinate: integer (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Updated On: string (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- Location: string (nullable = true)



### 1. Access a Column in PySpark

In [7]:
df.columns

['ID',
 'Case Number',
 'Date',
 'Block',
 'IUCR',
 'Primary Type',
 'Description',
 'Location Description',
 'Arrest',
 'Domestic',
 'Beat',
 'District',
 'Ward',
 'Community Area',
 'FBI Code',
 'X Coordinate',
 'Y Coordinate',
 'Year',
 'Updated On',
 'Latitude',
 'Longitude',
 'Location']

In PySpark, it's possible to access a DataFrame's column either by attribute (so a dot notation, or by indexing - where we would use square brackets). We cannot always use the dot notation since this will break when the column names have reserved names or attributes to the DataFrame Class. 

In [8]:
df.Year

Column<'Year'>

In [9]:
df['Community Area']

Column<'Community Area'>

In [10]:
df.select(col('Year')).show(5)

+----+
|Year|
+----+
|2021|
|2021|
|2021|
|2021|
|2021|
+----+
only showing top 5 rows



### 1. Access a Column in pandas

Similar rules apply in pandas.

In [11]:
import pandas as pd

df2 = df.toPandas()
type(df2)

pandas.core.frame.DataFrame

In [12]:
df2.columns

Index(['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type',
       'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat',
       'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate',
       'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude',
       'Location'],
      dtype='object')

In [13]:
df2.Year.head(5)

0    2021
1    2021
2    2021
3    2021
4    2021
Name: Year, dtype: int32

In [14]:
df2['Year'].tail(5)

207874    2021
207875    2021
207876    2021
207877    2021
207878    2021
Name: Year, dtype: int32

### 2. Select multiple columns in PySpark

In both PySpark and pandas, we can select more than one column using a list within square brackets. In PySpark, it's more common to use DataFrame .select() and then list the column names that we want to use.

In [15]:
df.select('Case Number', 'Year').show(3)

+-----------+----+
|Case Number|Year|
+-----------+----+
|   JE202728|2021|
|   JF125633|2021|
|   JE475344|2021|
+-----------+----+
only showing top 3 rows



### 2. Select multiple columns in pandas

In [16]:
df2[['Case Number', 'Year']].head(3)

Unnamed: 0,Case Number,Year
0,JE202728,2021
1,JF125633,2021
2,JE475344,2021


### 3. Add a Column in PySpark

To add a new column to our DataFrame, where the values in this new column are twice that of an existing column. In PySpark, we can use the withColumn() function. In pandas, we would specify the new name of the column, in square brackets. 

In [17]:
df = df.withColumn('DoubleID', 2*df['ID'])

In [18]:
df.select('ID', 'DoubleID').tail(3)

[Row(ID=12846092, DoubleID=25692184),
 Row(ID=12841050, DoubleID=25682100),
 Row(ID=12839669, DoubleID=25679338)]

### 3. Add a Column in pandas

In [19]:
df2['DoubleID'] = df2['ID'].multiply(2)

In [20]:
df2[['ID', 'DoubleID']].tail(3)

Unnamed: 0,ID,DoubleID
207876,12846092,25692184
207877,12841050,25682100
207878,12839669,25679338


### 4. Rename a Column in PySpark

In PySpark, we can use the withColumnRenamed() function, providing the current column name as the first argument, and the new column name as the second. Renaming a column returns a new Dataframe. If we provide a column name that doesn't exist, then no operation is performed. In pandas, we can use the rename() function, specifying the column names to be changed as a dictionary.

In [21]:
df = df.withColumnRenamed('DoubleID', 'RenamedColumn')

In [22]:
df.columns

['ID',
 'Case Number',
 'Date',
 'Block',
 'IUCR',
 'Primary Type',
 'Description',
 'Location Description',
 'Arrest',
 'Domestic',
 'Beat',
 'District',
 'Ward',
 'Community Area',
 'FBI Code',
 'X Coordinate',
 'Y Coordinate',
 'Year',
 'Updated On',
 'Latitude',
 'Longitude',
 'Location',
 'RenamedColumn']

### 4. Rename a Column in pandas

In [23]:
df2 = df2.rename(columns={'DoubleID':'RenamedColumn'})

In [24]:
df2.columns

Index(['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type',
       'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat',
       'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate',
       'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude',
       'Location', 'RenamedColumn'],
      dtype='object')

### 5. Remove Columns in PySpark

Removing or dropping a column have similar syntax in both pandas and PySpark. With PySpark, we can use the drop() function and this returns a new DataFrame that drops the specified column. If the column that needs to be dropped doesn't exist, then no operation will be performed.

In [25]:
df = df.drop('RenamedColumn')
df.columns

['ID',
 'Case Number',
 'Date',
 'Block',
 'IUCR',
 'Primary Type',
 'Description',
 'Location Description',
 'Arrest',
 'Domestic',
 'Beat',
 'District',
 'Ward',
 'Community Area',
 'FBI Code',
 'X Coordinate',
 'Y Coordinate',
 'Year',
 'Updated On',
 'Latitude',
 'Longitude',
 'Location']

In [26]:
df.drop('ID', 'Case Number', 'Date') # multiple drop

DataFrame[Block: string, IUCR: string, Primary Type: string, Description: string, Location Description: string, Arrest: boolean, Domestic: boolean, Beat: int, District: int, Ward: int, Community Area: int, FBI Code: string, X Coordinate: int, Y Coordinate: int, Year: int, Updated On: string, Latitude: double, Longitude: double, Location: string]

### 5. Remove Columns in pandas

In [27]:
df2.drop(columns='RenamedColumn', inplace=True)
df2.columns

Index(['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type',
       'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat',
       'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate',
       'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude',
       'Location'],
      dtype='object')

### 6. Add a column with name 'One', with entries all 1s in PySpark

Sometimes we might want to add a constant value for a column and we can do this using literals. So adding a column with name 'One' with entries all ones.

In [28]:
from pyspark.sql.functions import lit

df.withColumn('One', lit(1)).select('One').show(5)

+---+
|One|
+---+
|  1|
|  1|
|  1|
|  1|
|  1|
+---+
only showing top 5 rows



### 6. Add a column with name 'One', with entries all 1s in pandas

In [32]:
df2['One'] = 1
df2['One'].head(5)

0    1
1    1
2    1
3    1
4    1
Name: One, dtype: int64