<a href="https://colab.research.google.com/github/dareoyeleke/python_scripting/blob/main/with_open_regex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **File Handling & Regular Expressions**

This notebook introduces essential text-processing skills in Python using file I/O and regex.

I show how to:

- Read and write files using `with open()`  
- Search and extract text using regular expressions  
- Use patterns, character classes, and quantifiers  
- Clean and analyze text data effectively  

These techniques are foundational for any work involving logs, text files, or unstructured data.


In [None]:
# to access google drive files
from google.colab import drive
drive.mount('/content/drive')
drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive
Mounted at /content/drive


In [None]:
%%bash
# Create the directory if it doesn't exist
mkdir -p /tmp/sql_p2_csv.txt

# Copy the file from Google Drive to the local directory
cp /content/drive/MyDrive/sql_p2_csv.txt /tmp/sql_p2_csv.txt/

In [None]:
'''
using the with open function, r path, file method .read(), and string method .split(), with a comma as the delimeter to open a csv file and convert
into a mutable list, and then using the re module with the findall function to identify certain forms of text with REGEX expressions and patterns
'''
with open ("/tmp/sql_p2_csv.txt/sql_p2_csv.txt", "r") as file:
    sql_project2 = file.read()
print(sql_project2)
# importing re and using .findall function with string and regex patterns to find patterns from a list
import re
print(re.findall( "Data Analyst", sql_project2 ))

# to find all the salary values in the list for Data analyst jobs with no degree mentioned
print ('The salaries for Data Analyst jobs in 2023 are',(re.findall(r"\d{5,6}", sql_project2 )),'in ($)(USD)')

# to find all the salary values in the list for Data analyst jobs with no degree mentioned as well as the correlating posted dates
print ('The salaries with the date the jobs were posted are', (re.findall(r'\d{5,6},\d{4}-\d{2}-\d{2}', sql_project2 )))

# to find all salary values, along with the dates posted and print United states as the respective country of job posting
print ('The salaries ($USD) for US jobs with the posted dates are', (re.findall(r'United States,\d{5,6},\d{4}-\d{2}-\d{2}', sql_project2, )))



index,skills,type,company_name,job_title_short,job_no_degree_mention,job_location,salary_year_avg,job_posted_date
0,sql,programming,Invenergy,Data Analyst,true,United States,118640,2023-10-29
1,excel,analyst_tools,Invenergy,Data Analyst,true,United States,118640,2023-10-29
2,power bi,analyst_tools,Invenergy,Data Analyst,true,United States,118640,2023-10-29
3,sql,programming,"Udacity, Inc.",Data Analyst,true,United States,100500,2023-07-25
4,python,programming,"Udacity, Inc.",Data Analyst,true,United States,100500,2023-07-25
5,pandas,libraries,"Udacity, Inc.",Data Analyst,true,United States,100500,2023-07-25
6,numpy,libraries,"Udacity, Inc.",Data Analyst,true,United States,100500,2023-07-25
7,slack,sync,"Udacity, Inc.",Data Analyst,true,United States,100500,2023-07-25
8,zoom,sync,"Udacity, Inc.",Data Analyst,true,United States,100500,2023-07-25
9,sql,programming,American National,Data Analyst,true,United States,59500,2023-12-23
10,tableau,analyst_tools,American National,Data Analyst,tru

In [None]:
import re
high_paid_skills = ['index', 'skills', 'type', 'company_name', 'job_title_short', 'job_no_degree_mention', 'job_location', 'salary_year_avg', 'job_posted_date\n0', 'sql', 'programming', 'Invenergy', 'Data Analyst', 'true', 'United States', '118640', '2023-10-29\n1', 'excel', 'analyst_tools', 'Invenergy', 'Data Analyst', 'true', 'United States', '118640', '2023-10-29\n2', 'power bi', 'analyst_tools', 'Invenergy', 'Data Analyst', 'true', 'United States', '118640', '2023-10-29\n3', 'sql', 'programming', '"Udacity', ' Inc."', 'Data Analyst', 'true', 'United States', '100500', '2023-07-25\n4', 'python', 'programming', '"Udacity', ' Inc."', 'Data Analyst', 'true', 'United States', '100500', '2023-07-25\n5', 'pandas', 'libraries', '"Udacity', ' Inc."', 'Data Analyst', 'true', 'United States', '100500', '2023-07-25\n6', 'numpy', 'libraries', '"Udacity', ' Inc."', 'Data Analyst', 'true', 'United States', '100500', '2023-07-25\n7', 'slack', 'sync', '"Udacity', ' Inc."', 'Data Analyst', 'true', 'United States', '100500', '2023-07-25\n8', 'zoom', 'sync', '"Udacity', ' Inc."', 'Data Analyst', 'true', 'United States', '100500', '2023-07-25\n9', 'sql', 'programming', 'American National', 'Data Analyst', 'true', 'United States', '59500', '2023-12-23\n10', 'tableau', 'analyst_tools', 'American National', 'Data Analyst', 'true', 'United States', '59500', '2023-12-23']
print (high_paid_skills)
high_paid_skills = ','.join(high_paid_skills)
print ('The salaries for Data Analyst jobs in 2023 are',(re.findall(r"\d{5,6}", high_paid_skills)),'in ($)(USD)')
print ('The salaries with the date the jobs were posted was', (re.findall(r'\d{4}-\d{2}-\d{2}', high_paid_skills)))
print ('The salaries for US based jobs with the posted dates are', (re.findall(r"(United States,\d{5,6},\d{4}-\d{2}-\d{2})", high_paid_skills)), 'in ($)(USD)')

['index', 'skills', 'type', 'company_name', 'job_title_short', 'job_no_degree_mention', 'job_location', 'salary_year_avg', 'job_posted_date\n0', 'sql', 'programming', 'Invenergy', 'Data Analyst', 'true', 'United States', '118640', '2023-10-29\n1', 'excel', 'analyst_tools', 'Invenergy', 'Data Analyst', 'true', 'United States', '118640', '2023-10-29\n2', 'power bi', 'analyst_tools', 'Invenergy', 'Data Analyst', 'true', 'United States', '118640', '2023-10-29\n3', 'sql', 'programming', '"Udacity', ' Inc."', 'Data Analyst', 'true', 'United States', '100500', '2023-07-25\n4', 'python', 'programming', '"Udacity', ' Inc."', 'Data Analyst', 'true', 'United States', '100500', '2023-07-25\n5', 'pandas', 'libraries', '"Udacity', ' Inc."', 'Data Analyst', 'true', 'United States', '100500', '2023-07-25\n6', 'numpy', 'libraries', '"Udacity', ' Inc."', 'Data Analyst', 'true', 'United States', '100500', '2023-07-25\n7', 'slack', 'sync', '"Udacity', ' Inc."', 'Data Analyst', 'true', 'United States', '10