# R vs Python for Data Analytics: A Comparative Walkthrough
### Author: Ebuwa Evbuoma-Fike, Senior Data Scientist in Healthcare
### Last Edited: 2/3/25
### Workshop Date: 2/8/25

## Material for R vs Python Workshop

### Optional Pre-Workshop Setup: Python Environment for Data Analysis

This is an optional pre-workshop setup guide to help you set up and get familiar with the Python environment for the workshop.

**Installation**

1. Install Anaconda/Miniconda:

Conda is a popular, free, open-source package and environment management system for setting up and managing Python environments.

Download and install Anaconda or Miniconda from the official website: https://www.anaconda.com/. Choose the version appropriate for your operating system (Windows, macOS, Linux).
Note: Miniconda is a smaller, more lightweight version of Anaconda. It may have limitations.

2. Install VS Code:

VS Code a popular, free IDE. You are welcome to use another IDE of choice. I will be using VS Code during the workshop.

Download and install VS Code from the official website: https://code.visualstudio.com/).

**Setup**

3. Create a working directory by either:

- Saving the downloaded workshop repository from GitHub as a folder, to a specific location. Remember it. Instructions in 4a.

- Cloning the GitHub repository. Instructions in 4b.(recommended)

4a. Launch VS Code and open the named folder

Click "Open Folder" (image below). Choose the folder designated for this workshop in Step #3.

![AddFolders](adding_folders_vs_code.png)


4b. Launch VS Code and clone the repository

Click "Clone Repository" and choose a storage location (image below). Use the GitHub repo URL.

![Clone](cloning_repositories_vs_code.png)


5. Set up your Python environment

In your terminal in VS Code (see https://code.visualstudio.com/docs/terminal/basics), run the following (sans ``):

`conda env create -f rpy2_env.yml`

The first line of the yml file sets the new environment's name. It ensures that the libraries, versions and dependencies are set up precisely. Sometimes, one library needs an upgraded or downgraded version of another to run - this environment has them playing nicely. 

Next, to activate the conda environment, run the following (sans ``):

`conda activate rpy2_env`

More conda documentation and commands: https://docs.conda.io/projects/conda/en/latest/commands/index.html

6. Check your setup

You should see, to the far left of the command line prompt in your terminal window `(rpy2_env)`

To check your working directory, run the following (sans ``):

`pwd`

The printed path should match that of your folder from Step #3.

To verify that the environment you have matches the provided enviroment, run the following:

`conda list`

The list of libraries should match the content of the yml file (double-click on `rpy2_env.yml` and it will pop up in a new tab, adjacent to this notebook.)

Read the `citibike_trips_schema.xlsx` file to understand data definitions used for this workshop.

This concludes the setup tutorial.

### R vs Python for Data Analytics: A Comparative Walkthrough Technical Workshop

There are essentially two popular ways to work with R and Python in a local Python environment:
- Work with both languages interchangeably using the rpy2 library.
- Work distinctly in each language, using the r-essentials and rpy2 libraries.

### A. Working with R and Python, flexibly

#### 1. Import necessary modules:

In [None]:
import rpy2.robjects as robjects 
from rpy2.robjects import pandas2ri 

#### 2. Convert Python objects to R objects:

- Convert Python lists to R vectors:

In [None]:
python_list = [1, 2, 3, 4, 5] 
r_vector = robjects.IntVector(python_list) 

- Convert Python dictionaries to R lists:

In [None]:
python_dict = {'a': 1, 'b': 2, 'c': 3}
r_list = robjects.ListVector(python_dict)

- Convert `pandas` DataFrames to R data frames:

Pandas is a popular library for advanced data manipulation in Python.

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'col1': np.array([1, 2, 3]), 'col2': np.array(['a', 'b', 'c'])}) 
r_df = pandas2ri.py2ri(df)

#### 3.  Call R functions:

- Use the `robjects.r()` argument to execute R code:

In [None]:
r_result = robjects.r('mean(c(1, 2, 3))') 
print(r_result[0])  # Access the result 

- Call R functions directly:

In [None]:
r_mean = robjects.r['mean'] 
r_result = r_mean(r_vector) 
print(r_result[0]) 

#### 4. Convert R objects back to Python:

- Convert R vectors to Python lists:

In [None]:
python_list = list(r_vector) 

#### Example

This example demonstrates how to convert a `pandas` DataFrame to an R data frame and then apply the `summary()` argument from R.

Key Points:

- R Integration: The `rpy2` library allows you to seamlessly integrate R code and Python code within the same environment.

- Data Conversion: The `pandas2ri` module simplifies data conversion between pandas and R.

- Flexibility: You can call R functions directly or execute R code within Python using the `rpy2.robjects.r()` argument.

In [None]:
import rpy2.robjects as robjects 
from rpy2.robjects import pandas2ri 

# Create a pandas DataFrame
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']})

# Convert to R DataFrame
r_df = pandas2ri.py2ri(df)

# Apply R's summary argument
r_summary = robjects.r['summary'](r_df) 

# Print the summary of the R dataframe
print(r_summary) 

Key problems with option A:

- Planned deprecation of key modules (`pandas2ri`). This happens with package development. 

### B: Working with R and Python, distinctly

In [None]:
%load_ext rpy2.ipython

Did that run sans errors? Great!

Each time you want to use a variable with R (for example, the R dataframe `df`), you would “send” it to R using the following code:

`%%R -i df `

The alternative (more user-friendly, circumvents common errors) is to use the format below:
`robjects.r('''
    # your R code here. ''')`

The docstrings approach is used in this workshop.

### Data Import

1. In Python, load the dataset (the `.csv` file) and inspect dataset attributes.
- Refer to the `citibike_trips_schema.xlsx` file for a data dictionary.
- `head(x, n)`: Argument which returns the first n rows of the dataset.

In [None]:
# Load dataset
import pandas as pd
citibike_rides = pd.read_csv("n_1000_citibike_trips.csv")

In [None]:
# Inspect first n rows of the citibike_rides dataframe
citibike_rides.head(n=5)

In [None]:
# View information on datatypes, summary statistics etc
citibike_rides.info()
citibike_rides.describe()

**2. In R, load dataset (csv file) and inspect.**
- `head(x, n)`: Argument which returns the first n rows of the dataset.

In [None]:
robjects.r('''
    # Import the data
    citibike_rides <- read.csv("n_1000_citibike_trips.csv") 

''') 

If you wanted to *truly* amalgamate Python and R in one chunk...

- Here we call the R object, the imported dataset, `citibike_rides`, then, use the `print()` argument, in Python, to output the entire dataframe.

In [None]:
# Get the R object (citibike_rides)
r_citibike_rides = robjects.r['citibike_rides'] 

In [None]:
# You can now work with the R DataFrame directly
print(r_citibike_rides) 


- Next, we print a summary of the dataset. The advantage of this approach? It is legible and well-formatted.

In [None]:
# Perform R operations on it:
r_summary = robjects.r['summary'](r_citibike_rides) 
print(r_summary)

I *do not* recommend amalgamating R and Python. Instead, use your docstrings! See below...

In [None]:
robjects.r('''
    print(summary(citibike_rides))
''')

In [None]:
robjects.r('''
    # Inspect the first n rows of the dataset
    head(citibike_rides, n= 5)
''') 

### Data Manipulation

**1. In Python, select specific columns and filter the dataset.**

We are looking for all ride encounters with riders born after the year 1980.

In [None]:
# Subset columns as in selected_columns below
selected_columns = ["tripduration", "starttime", "stoptime", "start_station_name", 
                    "end_station_name", "bikeid", "usertype", "birth_year", "gender"]
citibike_rides_subset = citibike_rides[selected_columns]
citibike_rides_subset.head(n=10)  # Display the first n rows of the subset dataframe

In [None]:
# Filter the citibike_rides_subset dataframe to rides 
# with riders born after the year 1980
citibike_rides_subset[citibike_rides_subset["birth_year"] > 1980]

**2. In R, select specific columns and filter the dataset.**

We are looking for all ride encounters with riders born after the year 1980.
(base R equivalent of dplyr::filter)

After the workshop, you can use:

- filter() from dplyr
- select() from dplyr

In [None]:
robjects.r('''
    # Select specific columns from the dataset
    selected_columns <- c("tripduration", "starttime", "stoptime", 
                         "start_station_name", "end_station_name", 
                         "bikeid", "usertype", "birth_year", "gender")
    citibike_rides_selected <- citibike_rides[, selected_columns] 
''') 


In [None]:
robjects.r('''
    # Inspect the data
    head(citibike_rides_selected, n= 5)
''') 

In [None]:
robjects.r('''
    # Filter the dataset
    citibike_rides_selected[citibike_rides_selected$birth_year > 1990, ]

''') 

### Data Transformation

**1. In Python, transform data types (object to string) and create a new column, "age", calculated by subtracting birth_year from the current year at the time of this workshop.**

In [None]:
# Check the column data types using the `.dtypes` accessor
citibike_rides_subset.dtypes

In [None]:
# The `.columns` accessor outputs the column names in 
# our citibike_rides_subset dataframe.
citibike_rides_subset.columns

In [None]:
# Check existing data types using the `.dtypes` accessor
citibike_rides_subset.dtypes

In [None]:
# Change the datatype of the selected columns to category
import pandas as pd
citibike_rides_subset.astype({"start_station_name": "category", 
                            "end_station_name": "category", 
                            "usertype": "category", 
                            "gender": "category"}).dtypes

In [None]:
import pandas as pd
import datetime as dt
# Get the current year
current_year = dt.datetime.now().year 

# Calculate age as current_year - birth_year
citibike_rides_subset["age"] = current_year - citibike_rides_subset["birth_year"] 

# View content of birth_year and age for the first 5 rows
print(citibike_rides_subset[["birth_year", "age"]].head()) 

In [None]:
# Check data type transformations using `.dtypes` accessor
citibike_rides_subset.dtypes

In [None]:
# View the first n rows of your dataframe
citibike_rides_subset.head(n=5)

**2. In R, transform data types (object to character) and create a new column, "age", calculated by subtracting birth_year from the current year at the time of this workshop.**

In [None]:
robjects.r('''
    # Define the columns to convert to character
    cols_to_convert <- c("start_station_name", "end_station_name", "usertype", "gender")

    # Convert the specified columns to character
    for (col in cols_to_convert) {
      citibike_rides_selected[, col] <- as.character(citibike_rides_selected[, col])
    }

    # Return the modified data frame 
    head(citibike_rides_selected, n= 5)  # Return the modified data frame (n=5)

''') 

In [None]:
robjects.r('''
    # Get current year
    current_year <- as.numeric(format(Sys.Date(), "%Y")) 

    # Calculate age
    citibike_rides_selected$age <- current_year - citibike_rides_selected$birth_year 

    # Return the modified data frame with the new 'age' column
    head(citibike_rides_selected, n= 5)

''') 

### Summary Statistics

**1. In Python, find the mean age of commuters who set off from the five(5) most common start stations.** 

Assume a 1 to many relationship between bikeid and the other columns.

In [None]:
import pandas as pd

# Find the top 5 (by number of rides) most common start stations
citibike_rides_subset["start_station_name"].value_counts().head(n=5)



In [None]:
import pandas as pd

# 1. Find the top 5 most common start stations, and extract their names into a list
top_5_stations = citibike_rides_subset["start_station_name"].value_counts().head(n=5).index

# 2. Group the DataFrame by 'start_station_name'
grouped_citibike_rides_subset = citibike_rides_subset.groupby("start_station_name")

# 3. Calculate mean age for each station in the top 5
mean_age_by_station = {station: grouped_citibike_rides_subset.get_group(station)["age"].mean() for station in top_5_stations} 

# Output the generated dictionary
print(mean_age_by_station)

In [None]:
# Advanced - convert results from a dictionary to a dataframe
# Useful for additional analyses
import pandas as pd
# 4. Create a DataFrame with station names and their mean ages
results_mean_age_top_5_station = pd.DataFrame({'Station': list(mean_age_by_station.keys()), 
                          'Mean Age': list(mean_age_by_station.values())})

# 5. Print the result
print(results_mean_age_top_5_station)

**2. In R, find the mean age of commuters who set off from the five(5) most common start stations.**

Assume a 1 to many relationship between bikeid and the other columns.

Steps:
- Find Top 5 Stations:
    - Create a frequency table of start stations, sort the table in descending order of frequency, select the top 5 most frequent stations, extract the station names from the table.

- Calculate Mean Age for Each Station:
    - The code iterates through each of the top 5 stations, selects the rides that started at the current station, calculates the mean age of riders at the current station, handling potential missing values, and, the calculated mean age is appended to the `mean_ages` vector.
- Create and Print Results DataFrame:
    - A new data frame `results_mean_age_top_5_station` is created with two columns: "Station" and "Mean_Age". The results are printed to the console.

In [None]:
robjects.r('''
# Find the top 5 most frequent start stations
top_5_stations <- names(head(sort(table(citibike_rides_selected$start_station_name), decreasing = TRUE), 5))

# Create an empty list to store mean ages
mean_ages <- c()

# Calculate mean age for each of the top 5 stations
for (station in top_5_stations) {
  station_rides <- subset(citibike_rides_selected, start_station_name == station)
  mean_age <- mean(station_rides$age, na.rm = TRUE) 
  mean_ages <- c(mean_ages, mean_age)
}

# Create a data frame with station names and their mean ages
results_mean_age_top_5_station <- data.frame(Station = top_5_stations, Mean_Age = mean_ages)

''') 

### Data Visualization

**1. In Python, build a series of data visualizations, from simple to advanced.**

`matplotlib` and `seaborn` (which is built on top of matplotlib) are two widely popular data visualization libraries in Python.

A. Simple Histogram: Distribution of Trip Durations

We build a simple histogram to visualize the distribution of trip durations in the dataset. You can adjust the `bins` parameter to change the number of bins in the histogram.

In [None]:
import matplotlib.pyplot as plt

# Plot a histogram of trip durations
plt.figure(figsize=(10, 6))
plt.hist(citibike_rides_subset['tripduration'], bins=50, color='skyblue', edgecolor='black')
plt.xlabel('Trip Duration (seconds)')
plt.ylabel('Number of Trips')
plt.title('Distribution of Trip Durations')
plt.show()

B. Box Plot: Trip Duration by User Type

We use the `seaborn` library to create a box plot, which visually compares the distribution of trip durations for different user types (e.g., Subscriber, Customer). Box plots are useful for identifying potential outliers and comparing the spread of data across different groups.

In [None]:
import seaborn as sns

# Create a box plot of trip duration by user type
sns.boxplot(x='usertype', y='tripduration', data=citibike_rides_subset)
plt.xlabel('User Type')
plt.ylabel('Trip Duration (seconds)')
plt.title('Trip Duration by User Type')
plt.show()

C. Simple Bar Chart: Distribution of User Types

This visualization will show the distribution of user types (Subscriber vs. Customer) in the dataset. Subscribers have an annual pass.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Count occurrences of each user type
user_type_counts = citibike_rides_subset['usertype'].value_counts()

# Create a bar chart
plt.figure(figsize=(8, 5))
user_type_counts.plot(kind='bar', color='skyblue')
plt.title('Distribution of User Types')
plt.xlabel('User Type')
plt.ylabel('Number of Rides')
plt.xticks(rotation=360)
plt.show()

D. Scatter Plot with Age and Trip Duration

This visualization will explore the relationship between user age and trip duration.

In [None]:
import seaborn as sns

# Create a scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='age', y='tripduration', data=citibike_rides_subset, alpha=0.5)
plt.title('Trip Duration vs. User Age')
plt.xlabel('User Age (years)')
plt.ylabel('Trip Duration (seconds)')
plt.show()

**2. In R, build a series of data visualizations, from simple to advanced.**

A. Simple Histogram: Distribution of Trip Durations

We build a simple histogram to visualize the distribution of trip durations in the dataset.

In [None]:
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri

robjects.r('''
# Create a histogram of trip durations
hist(citibike_rides_selected$tripduration, 
     breaks = 50, 
     col = "skyblue", 
     border = "black", 
     xlab = "Trip Duration (seconds)", 
     ylab = "Number of Trips", 
     main = "Distribution of Trip Durations")

''') 

B. Simple Bar Chart: Distribution of User Types

This visualization will show the distribution of user types (Subscriber vs. Customer) in the dataset. Subscribers have an annual pass.

In [None]:
import rpy2.robjects as robjects

robjects.r('''
# Count the occurrences of each user type
user_type_counts <- table(citibike_rides_selected$usertype)

# Create the bar plot
barplot(user_type_counts, 
        main = "Distribution of User Types", 
        xlab = "User Type", 
        ylab = "Number of Rides", 
        col = "lightblue") 

''') 

Thanks for your attention and engagement today! You're welcome to connect with me on LinkedIn (please remember to add a message to your connection request.)