<div class="alert alert-block alert-info">
Author:<br>Felix Gonzalez, P.E. <br> Adjunct Instructor, <br> Division of Professional Studies <br> Computer Science and Electrical Engineering <br> University of Maryland Baltimore County <br> fgonzale@umbc.edu
</div>

This notebook provides an overview of basic concepts in workign with various types of data sources and files in the Python Programming Language and Jupyter Notebooks. These data sources include files such as txt, json, and csv. The notebook also includes a discussion of various functions and libraries as well as methods on working with multiple data sources files. As a data scientist you will have to work with data in multiple files as well as multiple data file types. In many cases you will need to consolidate, merge, or concatenate to make the dataset more useful. 

# Table of Contents
[Python Libraries in this Notebook](#Python-Libraries-in-this-Notebook)

[OS Library](#OS-Library)

- [Working Directory Path Environment](#Working-Directory-Path-Environment)

- [Working in the Directory](#Working-in-the-Directory)

[Pandas: Working with Multiple Files](#Pandas:-Working-with-Multiple-Files)

- [Reading Multiple CSV Files and Pandas: Electricity Use Data Example](#Reading-Multiple-CSV-Files-and-Pandas:-Electricity-Use-Data-Example)

[Reading Data Line By Line](#Reading-Data-Line-By-Line)

[Working with PDFs](#Working-with-PDFs)

[Python Built-In Functions for Loading Data (OPTIONAL)](#Python-Built-In-Functions-for-Loading-Data-(OPTIONAL))

- [Open Function](#Open-Function)

- [With Statement](#With-Statement)

- [Example with Numerical Data in a TXT File](#Example-with-Numerical-Data-in-a-TXT-File)

[JSON Library: JSON Files (OPTIONAL)](#JSON-Library:-JSON-Files-(OPTIONAL))

[Working with CSV Files (CSV Library)](#Working-with-CSV-Files-(CSV-Library))

[Reading Multiple TXT Files (OPTIONAL)](#Reading-Multiple-TXT-Files-(OPTIONAL))

# Python Libraries in this Notebook
[Return to Table of Contents](#Table-of-Contents)

In [1]:
import os
import pathlib
import glob
import requests
from urllib.request import urlopen
import csv
import re
import fileinput
import shutil
import json

import pandas as pd

# OS Library
[Return to Table of Contents](#Table-of-Contents)

The Operating System (OS) module provides a portable way of using OS dependent functionality. There are various useful functions and modules within the OS library. Some of these include:
- Open() function: read or write a file  
- Os.path module: Read and manipulate paths  
- Fileinput module: Read all the lines in all the files on the command line
- Tempfile module: Creating temporary files and directories 
- shutil module: High-level file and directory handling

There are other libraries that also make it easy to work with files such as the Glob library.

Documentation References:
- OS Library: https://docs.python.org/3/library/os.html
- Glob Library: https://docs.python.org/3/library/glob.html

There are a few things to remember and that will vary from system to system.

In [2]:
os.linesep # Note on line separator which will vary depending in the OS system.

'\r\n'

From the OS Library documentation:

os.linesep: "The string used to separate (or, rather, terminate) lines on the current platform. This may be a single character, such as '\n' for POSIX, or multiple characters, for example, '\r\n' for Windows. Do not use os.linesep as a line terminator when writing files opened in text mode (the default); use a single '\n' instead, on all platforms."

Also recall that paths in windows uses the "\" and in Unix/MacOS is "/" and depending on your operating system you may get different resutls in the cell above. The Unix style path works on both operating systems whithin the Python environment and should be the __preferred__.  Recall that the \ is also an escape character and __NOT__ the preferred approach.

References:
- https://stackoverflow.com/questions/1589930/so-what-is-the-right-direction-of-the-paths-slash-or-under-windows

#### Working Directory Path Environment
[Return to Table of Contents](#Table-of-Contents)

In [3]:
print(os.environ["PATH"])

C:\ProgramData\anaconda3;C:\ProgramData\anaconda3\Library\mingw-w64\bin;C:\ProgramData\anaconda3\Library\usr\bin;C:\ProgramData\anaconda3\Library\bin;C:\ProgramData\anaconda3\Scripts;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;C:\Users\felix\AppData\Local\Microsoft\WindowsApps;C:\Users\felix\AppData\Roaming\Python\Python312\Scripts;


In [4]:
# If in Windows, the home environment/folder will not work. See next cell.
print(os.environ["HOME"])

KeyError: 'HOME'

In [6]:
#If in Windows need to use the follwoing instead of os.environ["HOME"]
home_folder = os.path.expanduser('~')
print(home_folder)

C:\Users\felix


In [7]:
# Current working directory.
os.getcwd()

'C:\\Felix_ASUS_Docs\\1A_Python_Projects\\DATA601_Files\\Lecture09'

#### Working in the Directory
[Return to Table of Contents](#Table-of-Contents)

In [8]:
# List of everything in the present directory (Note: Does not include files in subfolders)
os.listdir(path='.')

['.ipynb_checkpoints',
 '19_Working_wFiles.ipynb',
 '20_Working_with_WebData_and_APIs.ipynb',
 '21a_SQL_DB_Test_File_Creation.ipynb',
 '21b_Relational_Databases_and_SQL.ipynb',
 'database.sqlite',
 'input_data',
 'output_data']

In [9]:
# run the command mkdir (make a directory) in the system shell
os.system('mkdir test_folder')

0

In [10]:
# list everything in the present directory
os.listdir(path='.') 

['.ipynb_checkpoints',
 '19_Working_wFiles.ipynb',
 '20_Working_with_WebData_and_APIs.ipynb',
 '21a_SQL_DB_Test_File_Creation.ipynb',
 '21b_Relational_Databases_and_SQL.ipynb',
 'database.sqlite',
 'input_data',
 'output_data',
 'test_folder']

In [11]:
# now we have a new folder called "test_folder". let's go into that directory
os.chdir('./test_folder/')

In [12]:
os.getcwd()

'C:\\Felix_ASUS_Docs\\1A_Python_Projects\\DATA601_Files\\Lecture09\\test_folder'

In [13]:
# let's go back to previous folder
os.chdir('./..')
os.getcwd()

'C:\\Felix_ASUS_Docs\\1A_Python_Projects\\DATA601_Files\\Lecture09'

In [14]:
os.system('mkdir test_folder2')

0

In [15]:
# let's create a folder inside the test_folder2
#os.system('mkdir test_folder2/test_subfolder1')

# Alternatively to os.system, we can use the os.mkdir.
os.mkdir('test_folder2/test_subfolder1')

In [16]:
# let's list subdirectories using the pathlib library
p = pathlib.Path('.') # current directory
for x in p.iterdir():
    if x.is_dir():
        print(x)

.ipynb_checkpoints
input_data
output_data
test_folder
test_folder2


In [17]:
# CREATING A FILE AT CURRENT DIRECTORY
f = open('test_file.txt',"w+")

# WRITING IN A FILE
for x in range(10):
    f.write("This is line %d\r\n" % (x+1))
    
# CLOSING A FILE
f.close()

In [18]:
# OPENING AN EXISTING FILE AND APPENDING IT
g = open("test_file.txt","+a")

for x in range(5):
    g.write("Appended line %d\r\n" % (x+1))

g.close()

In [19]:
# READING A LOCAL FILE LINE BY LINE (More on this in later sections).
# See section "Reading Data Line By Line"
with open('test_file.txt','r') as h:
    for line in h:
        print(line.rstrip('\n'))

This is line 1

This is line 2

This is line 3

This is line 4

This is line 5

This is line 6

This is line 7

This is line 8

This is line 9

This is line 10

Appended line 1

Appended line 2

Appended line 3

Appended line 4

Appended line 5



In [20]:
# let's list subdirectories (if any) using the pathlib library
p = pathlib.Path('.') # current directory
for x in p.iterdir():
    if x.is_dir():
        print(x)

.ipynb_checkpoints
input_data
output_data
test_folder
test_folder2


In [21]:
# Let's remove the two directories that we created.
os.rmdir('test_folder')
os.rmdir('test_folder2') # Will give an error because the folder is not empty.

OSError: [WinError 145] The directory is not empty: 'test_folder2'

In [22]:
# To remove directory and subdirectories use the shutil library
shutil.rmtree('test_folder2')
#shutil.rmtree('../test_folder2')

In [23]:
# Let's list subdirectories (if any) using the pathlib library
p = pathlib.Path('.') # current directory
for x in p.iterdir():
    if x.is_dir():
        print(x)

.ipynb_checkpoints
input_data
output_data


Using the glob library let's list all the txt files in this folder and subfolders using the glob library. The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order. 

Documentation References:
- https://docs.python.org/3/library/glob.html

In [24]:
glob.glob('**/*.txt', recursive=True)

['test_file.txt',
 'input_data\\anotherfile.txt',
 'input_data\\bread.txt',
 'input_data\\ex1data1.txt',
 'input_data\\text_files\\example3.txt',
 'input_data\\text_files\\FAME.TXT',
 'input_data\\text_files\\Genescan.txt',
 'output_data\\anotherfile.txt',
 'output_data\\json_data.txt',
 'output_data\\textout.txt']

Another option to obtain all the files in the directory is to use the os.walk() function.

Documentation References:
- os.walk(): https://docs.python.org/3/library/os.html

In [25]:
all_files_tuple = [x for x in os.walk(top = './')] # Produces a 3-tuple of (dirpath, dirnames, filenames)

all_files_dir = [] # Defines a starting empty list for the all_files_dir.

for i in range(len(all_files_tuple)): # Iterates thru each index or element in all_files_tuple.
    # Iterates thru files in dir and only select files in directory with specified extension (case insensitive).
    txt_files = [filename for filename in os.listdir(all_files_tuple[i][0]) if filename.lower().endswith('.txt'.lower())]
    # Combines both the relative path and filename
    txt_files = [all_files_tuple[i][0] +'/'+ filename  for filename in txt_files]
    # Consolidates all files from the path being iterated into the main list.
    all_files_dir = all_files_dir + txt_files

In [26]:
all_files_tuple

[('./',
  ['.ipynb_checkpoints', 'input_data', 'output_data'],
  ['19_Working_wFiles.ipynb',
   '20_Working_with_WebData_and_APIs.ipynb',
   '21a_SQL_DB_Test_File_Creation.ipynb',
   '21b_Relational_Databases_and_SQL.ipynb',
   'database.sqlite',
   'test_file.txt']),
 ('./.ipynb_checkpoints',
  [],
  ['19_Working_wFiles-checkpoint.ipynb',
   '20_Working_with_WebData_and_APIs-checkpoint.ipynb',
   '21b_Relational_Databases_and_SQL-checkpoint.ipynb']),
 ('./input_data',
  ['airline_data', 'electricity_use_data_cleaned', 'text_files'],
  ['addresses.csv',
   'anotherfile.txt',
   'bread.txt',
   'csv_plain_file.csv',
   'ex1data1.txt',
   'geo_data.json',
   'NRC_ASP_DATA_from_Public_ASP_Dashboard.csv']),
 ('./input_data\\airline_data',
  [],
  ['airlines.csv',
   'airports.csv',
   'flights.csv',
   'ZIPCODE_SQFEET_HEATERTYPE_HOUSETYPE_HOUSEUSAGE-TEMPLATE.csv']),
 ('./input_data\\electricity_use_data_cleaned',
  [],
  ['20737_1006_electric_singlefamily_primary_cleaned.csv',
   '20871_24

In [27]:
all_files_dir # The output here is similar to the output in the glob.glob.

['.//test_file.txt',
 './input_data/anotherfile.txt',
 './input_data/bread.txt',
 './input_data/ex1data1.txt',
 './input_data\\text_files/example3.txt',
 './input_data\\text_files/FAME.TXT',
 './input_data\\text_files/Genescan.txt',
 './output_data/anotherfile.txt',
 './output_data/json_data.txt',
 './output_data/textout.txt']

In [28]:
# Let's find files that are in subfolders only.
sub_folders = [] # Defines a starting empty List of subfolder names that will have specified file type (e.g., csv).

# Iterates thru all file directories to extract Folder name. Located at all_files_dir[i].split('/')[2]
for i in range(len(all_files_dir)):
    sub_folders = sub_folders + [all_files_dir[i].split('/')[2]]
sub_folders = list(set(sub_folders)) # Creates unique list of files in subfolders by removing duplicates.

In [29]:
sub_folders # List of files in subfolders (No duplicates) Can remove the set if there would be duplicate names.

['ex1data1.txt',
 'FAME.TXT',
 'test_file.txt',
 'example3.txt',
 'bread.txt',
 'Genescan.txt',
 'json_data.txt',
 'textout.txt',
 'anotherfile.txt']

In [30]:
# Let's remove the test_file.txt that we created earlier and check the txt files in the directory.
os.remove('./test_file.txt')
glob.glob('**/*.txt', recursive=True)

['input_data\\anotherfile.txt',
 'input_data\\bread.txt',
 'input_data\\ex1data1.txt',
 'input_data\\text_files\\example3.txt',
 'input_data\\text_files\\FAME.TXT',
 'input_data\\text_files\\Genescan.txt',
 'output_data\\anotherfile.txt',
 'output_data\\json_data.txt',
 'output_data\\textout.txt']

In [31]:
glob.glob('input_data/*.csv', recursive=True)

['input_data\\addresses.csv',
 'input_data\\csv_plain_file.csv',
 'input_data\\NRC_ASP_DATA_from_Public_ASP_Dashboard.csv']

# Pandas: Working with Multiple Files
[Return to Table of Contents](#Table-of-Contents)

Up to this moment we have used Pandas to load data from a single file and in a few cases join/merge/concatenate with data from another file. As a data scientist you will have to work with data in multiple files and in many cases you will need to consolidate, merge, or concatenate to make the dataset more useful. This section discusses some examples on working with multiple files using OS and Pandas library.

#### Reading Multiple CSV Files and Pandas: Electricity Use Data Example
[Return to Table of Contents](#Table-of-Contents)

When reading and extracting data from multiple files there may be various ways. In this case we will explore extracting data from the electricity use data that we collected at the beginning of the class and explore various challenges that we may encounter when processing the data files. More often than not, when the data was collected manually by various people there may be differences in the files that may provide challenges on how the data is extracted.

The first step is to manually get familiarized with the datasets as well as the columns and features collected. Once the feature names are explored and normalized (i.e., made the same) we can continue to merge the datasets and extract any data that we may need to extract from the filename or the metadata.

In [32]:
# See Documentation for os.walk(): https://docs.python.org/3/library/os.html
# Produces a 3-tuple of (dirpath, dirnames, filenames)
all_files_tuple = [x for x in os.walk(top = './input_data/electricity_use_data_cleaned/')] # Produces a 3-tuple of (dirpath, dirnames, filenames)
all_files_tuple

[('./input_data/electricity_use_data_cleaned/',
  [],
  ['20737_1006_electric_singlefamily_primary_cleaned.csv',
   '20871_2400_GAS_TOWNHOUSE_PRIMARY_cleaned.csv',
   '20871_3200_electric_singlefamily_primary_cleaned.csv',
   '20874_NA_ELECTRIC_Apartiment_PRIMARY_cleaned.csv',
   '20904_1700_ELECTRIC_TOWNHOUSE_PRIMARY_cleaned.csv'])]

In [33]:
# Call the list of the file names.
all_files_tuple[0][2]

['20737_1006_electric_singlefamily_primary_cleaned.csv',
 '20871_2400_GAS_TOWNHOUSE_PRIMARY_cleaned.csv',
 '20871_3200_electric_singlefamily_primary_cleaned.csv',
 '20874_NA_ELECTRIC_Apartiment_PRIMARY_cleaned.csv',
 '20904_1700_ELECTRIC_TOWNHOUSE_PRIMARY_cleaned.csv']

In [34]:
# Goal: I want to obtain a list of the specified type file (e.g., csv) with their relative directory to later iterate thru.
all_files_dir = [] # Defines a starting empty list for the all_files_dir.
sub_folders = [] # Defines a starting empty List of subfolder names that will have specified file type (e.g., csv).

for i in range(len(all_files_tuple)): # Iterates thru each index or element in all_files_tuple.
    # Iterates thru files in dir and only select files in directory with specified extension (case insensitive).
    csv_files = [filename for filename in os.listdir(all_files_tuple[i][0]) if filename.lower().endswith('.csv'.lower())]
    # Combines both the relative path and filename
    csv_files = [all_files_tuple[i][0] +'/'+ filename  for filename in csv_files]
    # Consolidates all files from the path being iterated into the main list.
    all_files_dir = all_files_dir + csv_files
    
# Iterates thru all file directories to extract Folder name. Located at all_files_dir[i].split('/')[2]
#for i in range(len(all_files_dir)):
    sub_folders = sub_folders + [all_files_dir[i].split('/')[2]]   
sub_folders = set(sub_folders) # Creates unique list of subfolders by removing duplicates.
# The subfolders could mean a year a location a zipcode or some other important information.
# Because there are no subfolders in the input_data/electricity_use_data_cleaned this will be an empty list.

# Checks if final list has not duplicate files.
print(f'List has no duplicate files: {len(all_files_dir) == len(set(all_files_dir))}.')
print(f'There are {len(all_files_dir)} specified file type (e.g., csv).') # Shows how many files are in the final list.
print(f'Unique subfolders with specified file type (e.g., csv): {sub_folders}.') # Show list of sub-folders.
print(f'Sample of three files: {all_files_dir[:3]}.') # Shows only the first 3 files in the list.

List has no duplicate files: True.
There are 5 specified file type (e.g., csv).
Unique subfolders with specified file type (e.g., csv): {'electricity_use_data_cleaned'}.
Sample of three files: ['./input_data/electricity_use_data_cleaned//20737_1006_electric_singlefamily_primary_cleaned.csv', './input_data/electricity_use_data_cleaned//20871_2400_GAS_TOWNHOUSE_PRIMARY_cleaned.csv', './input_data/electricity_use_data_cleaned//20871_3200_electric_singlefamily_primary_cleaned.csv'].


In [35]:
all_files_dir # In this case there are only a handful of files so we can see the full list.

['./input_data/electricity_use_data_cleaned//20737_1006_electric_singlefamily_primary_cleaned.csv',
 './input_data/electricity_use_data_cleaned//20871_2400_GAS_TOWNHOUSE_PRIMARY_cleaned.csv',
 './input_data/electricity_use_data_cleaned//20871_3200_electric_singlefamily_primary_cleaned.csv',
 './input_data/electricity_use_data_cleaned//20874_NA_ELECTRIC_Apartiment_PRIMARY_cleaned.csv',
 './input_data/electricity_use_data_cleaned//20904_1700_ELECTRIC_TOWNHOUSE_PRIMARY_cleaned.csv']

After we know the file paths for the data that we want, we have various options:
1. We can initialize the main dataframe by creating a new blank dataframe with the required columns 
2. Use the columns from the TEMPLATE but we also need to add the columns related to the file names.

However, before attempting this we should manually open a few of the CSV files to see if we can spot potential challenges or problems. The original files submitted by students can be found under 'Homework/Special_HW_Home_Electricity_Consumption' folder. The files in the input_data/electricity_use_data_cleaned has been manually fixed. Issues that we can observe in the data:
- Different date formats
- Some dataframes used a date range while others estimated the month and year. 
- Some others added extra data that seemed to be available.
- Issues with column naming.

Note that the instructions were vague on purpose to demonstrate potential issues that may arise during the data collection stage. Even when instructions are clear you will encounter challenges like this. Coordinating during data collection can make improvements in the quality of the data and improve the insights when it is analyzed. Issues in the data collection can sometimes cause that the data is unusable and having to be discarded. 

There are a few things that we can do to solve the issue.
1. If we only had a handful of files we can manually make the change (which is the case here). In some cases we may have to do some assumptions to make the data usable.
2. If we had lots of files (e.g., hundreds) we may need to create an if statement that detects the type of file issue, then do either of:
    a. fix the issue before extracting
    b. extract the data from the appropriate location

Another issue that we could probably see is that given the current structure of the dataset, there will be a high likelihood that we may have duplicates in the data. For example, if we obtain more data, there is a high likelihood that our neighbors data may be very similar and in case of the filenaming exactly the same. This will cause issues as the operating systems do not allow files to have the same filename. The address could have been a good addition to the filename to avoid this. 

Depending on the issue addressing them may be easy, to very difficult to impossible. Option 2 above assumes that the issues is consistent within various files and that the majority of the data collected was using the correct template or format and hence the issue can use an if statement to be identified.. Imagine having this issues accross hundreds of files that we need to process.

In [36]:
# Let's explore the TEMPLATE dataframe, other files and explore their columns.
# Note that the columns are in the first row.
i = -1 # Use from -1 to zero to postive.
print(csv_files[i])
pd.read_csv(csv_files[i], header = None).head(5)

./input_data/electricity_use_data_cleaned//20904_1700_ELECTRIC_TOWNHOUSE_PRIMARY_cleaned.csv


Unnamed: 0,0,1,2,3
0,Date,Bill Amount,Days in Billing Cycle,KWH Usage
1,2023 February,280,30,1400
2,2023 January,450,35,2500
3,2022 December,290,30,1700
4,2022 November,270,28,900


In [37]:
# We want to combine all the dataframes with electricity use data including the data in the filename
# One approache could be to initialize a main empty dataframe where the data gets added with the right columns.
# Let's initialize an empty main DataFrame with the required columns.
df_electricity_usage = pd.DataFrame(columns = ['Zipcode', 'Home Square Footage', 'Heater Type', 'Home Type', 
                                               'Home Usage', 'Date', 'Bill Amount', 'Days in Billing Cycle', 
                                               'KWH Usage'])
# Note that some of the data will be in the filename while other data will be inside the file.
df_electricity_usage

Unnamed: 0,Zipcode,Home Square Footage,Heater Type,Home Type,Home Usage,Date,Bill Amount,Days in Billing Cycle,KWH Usage


In [38]:
# Let's explore the data for file at index 0.
i = 0
file_name_list = csv_files[i].split("/")[4].split('_')
file_name_list

['20737', '1006', 'electric', 'singlefamily', 'primary', 'cleaned.csv']

In [39]:
testdf = pd.read_csv(csv_files[i], header = 0)
# We can add a column of the filename information with the insert function.
testdf.insert(0, "Home Usage", file_name_list[4])
testdf

Unnamed: 0,Home Usage,Date,Bill Amount,Days in Billing Cycle,KWH Usage
0,primary,2021 December,$117.03,32,857
1,primary,2022 January,$156.02,27,1114
2,primary,2022 February,$98.69,28,681
3,primary,2022 March,$78.70,30,532
4,primary,2022 April,$67.62,30,429
5,primary,2022 May,$144.69,29,851
6,primary,2022 June,$226.59,32,1436
7,primary,2022 July,$221.60,27,1445
8,primary,2022 August,$235.64,31,1436
9,primary,2022 September,$145.08,30,884


In [40]:
# We need to create a loop that iterates thru all the files.
for i in range(len(csv_files)):
    # Loading csv file as temporary df.
    tempdf = pd.read_csv(csv_files[i], header = 0) # Header located at row index 0.
    # Filename information in a list using underscore as delimiter.
    file_name_list = csv_files[i].split("/")[4].split('_')
    # Inserts the new columsn for Zipcode, Home Square Footage, Heater type, and Home Usage
    tempdf.insert(0, "Home Usage", file_name_list[4])
    tempdf.insert(0, "Home Type", file_name_list[3])
    tempdf.insert(0, "Heater Type", file_name_list[2])
    tempdf.insert(0, "Home Square Footage", file_name_list[1])
    tempdf.insert(0, "Zipcode", file_name_list[0])

    # Before concatenating, 
    # Let's check the tempdf has the same number of features as the main dataframe (e.g., df_electricity_usage).
    try:
        if all(df_electricity_usage.columns == tempdf.columns): # If columns don't match will give an error hnce the try-except.
            df_electricity_usage = pd.concat([df_electricity_usage, tempdf], axis=0) # Adds tempdf at bottom of main df.
    except:
        print(f'PROCESS STOPPED: Following file does not has the same features/columns:')
        print(f'{csv_files[i]}') # Prints the file with the issue.
        break # If there is a row that does not meet this condition for loop breaks

In [41]:
print(df_electricity_usage.shape)
df_electricity_usage#.head(60)
# Note there are 91 rows but the last row index is 12?
# Should have used or should use the reset_index.
# If this goes unnoticed you will potentially have issues later.

(78, 9)


Unnamed: 0,Zipcode,Home Square Footage,Heater Type,Home Type,Home Usage,Date,Bill Amount,Days in Billing Cycle,KWH Usage
0,20737,1006,electric,singlefamily,primary,2021 December,$117.03,32,857
1,20737,1006,electric,singlefamily,primary,2022 January,$156.02,27,1114
2,20737,1006,electric,singlefamily,primary,2022 February,$98.69,28,681
3,20737,1006,electric,singlefamily,primary,2022 March,$78.70,30,532
4,20737,1006,electric,singlefamily,primary,2022 April,$67.62,30,429
...,...,...,...,...,...,...,...,...,...
8,20904,1700,ELECTRIC,TOWNHOUSE,PRIMARY,2022 June,140,30,800
9,20904,1700,ELECTRIC,TOWNHOUSE,PRIMARY,2022 May,150,31,900
10,20904,1700,ELECTRIC,TOWNHOUSE,PRIMARY,2022 April,160,31,1200
11,20904,1700,ELECTRIC,TOWNHOUSE,PRIMARY,2022 March,265,31,2000


In [42]:
# What happens if I use an index based data selection and filter?
df_electricity_usage.iloc[0, 1] # Only seems to select the first index 0.

'1006'

In [43]:
df_electricity_usage.iloc[0:3, :] # Only seems to select the first index range 0 to 2.

Unnamed: 0,Zipcode,Home Square Footage,Heater Type,Home Type,Home Usage,Date,Bill Amount,Days in Billing Cycle,KWH Usage
0,20737,1006,electric,singlefamily,primary,2021 December,$117.03,32,857
1,20737,1006,electric,singlefamily,primary,2022 January,$156.02,27,1114
2,20737,1006,electric,singlefamily,primary,2022 February,$98.69,28,681


In [44]:
df_electricity_usage.reset_index(inplace = True, drop =True)
df_electricity_usage.tail()

Unnamed: 0,Zipcode,Home Square Footage,Heater Type,Home Type,Home Usage,Date,Bill Amount,Days in Billing Cycle,KWH Usage
73,20904,1700,ELECTRIC,TOWNHOUSE,PRIMARY,2022 June,140,30,800
74,20904,1700,ELECTRIC,TOWNHOUSE,PRIMARY,2022 May,150,31,900
75,20904,1700,ELECTRIC,TOWNHOUSE,PRIMARY,2022 April,160,31,1200
76,20904,1700,ELECTRIC,TOWNHOUSE,PRIMARY,2022 March,265,31,2000
77,20904,1700,ELECTRIC,TOWNHOUSE,PRIMARY,2022 February,260,30,2000


A few observations:
- Depending on the approach you used and your dataset it may be a good idea to check for duplicated values. If there are duplicated values check the source and if they are true duplicates.
- In the case above the repeated rows seem to be from the template. I can remove the template from the input_data folder and rerun the notebook which will fix the issue.
- Reset index is an invaluable tool especially when creating dataframes or copying a subset of a dataframe. There may be reasons to keep the original dataframe index (i.e., when you want to reference back. 

Once the data is combined we can use various of the functions we have learned to explore the data.

In [45]:
df_electricity_usage['Home Square Footage'].value_counts()
# Here we can see a few issues such as the SQFEET.

Home Square Footage
2400    27
3200    13
NA      13
1700    13
1006    12
Name: count, dtype: int64

Once the data is combined and cleaned into one dataframe we may want to export the combined dataframe as a CSV file, explore the data and evaluate if it needs further cleaning before using for analysis or model development. For example, the last rows in the df_electricity_usage were from the template and can be dropped and not needed. We also need to explore any issues that the dataset may have such as null values, duplicates, other data issues, and creating derived features as needed.

# Reading Data Line By Line
[Return to Table of Contents](#Table-of-Contents)

Many of the approaches below focus on reading data line by line. Loading and processing data line by line may be beneficial in some cases. This may be the case when a file may be too big to load locally in a computer or reading realtime data. In some cases, this approach may be more efficient (i.e., memory or processing perspective), make identifying error location easier, and allow reading a file in real-time. Note that in cases of large files other (better) options may include cloud environments where we may have more computational power at our disposal.

# Working with PDFs
[Return to Table of Contents](#Table-of-Contents)

There are various libraries that can read PDF's. Most common are PDFMiner and PyPDF2, however, none of the two are included in the Anaconda Python Distribution (potentially not as mature as other data science libraries). I have had the best experience converting the PDFs to HTML using Acrobat Pro (which requires a license) and then using an HTML Python library (like Beautifulsup) to extract the PDF text information which will include some information such as if the text is part of the heading, body, paragraph, etc. Other approaches may include converting to TXT and other type of files before processing. See the notebook '20_Working_with_WebData_and_API' for an example on working with PDF files.

# Python Built-In Functions for Loading Data (OPTIONAL)
[Return to Table of Contents](#Table-of-Contents)

Python has various built-in functions to work with files. This sections discusses the open function which is used to Open a file and return the corresponding file object. The function has various parameters including the file directory and mode. The mode specifies the function (e.g., reading, writing). The following are the options for the mode:
- 'r': open for reading (default)
- 'w': open for writing, truncating the file first
- 'x': open for exclusive creation, failing if the file already exists
- 'a': open for writing, appending to the end of file if it exists
- 'b' binary mode
- 't' text mode (default)
- '+' open for updating (reading and writing)

Documentation References:
- List of Python Built-in Functions https://docs.python.org/3/library/functions.html
- Open Function: https://docs.python.org/3/library/functions.html#open

## Open Function
[Return to Table of Contents](#Table-of-Contents)

In [46]:
f = open(file = "./input_data/bread.txt", mode = 'r')
for line in f:
    print(line)
f.close()

				Spent Grain Bread



	I got this recipe from the 1985 Grain Brewing Issue of Zymurgy magazine.  I tried it as is, and wit bananas added and found it to make a good heavy bread.  The original article is by Clifford T. Newmn Jr. and he requests any original recipies to be sent to him at P.O. Box 193, Port Matilda, PA  1680



	-4 C. fresh spent grains (from your latest batch of 

	-1 C. water               all grain beer)

	-1/2 C. oil

	-1/2 C. sugar

	-1/4 tsp. salt

	-1 Tbsp. dry baker's yeast

	-All-purpose flour (enough to make a stiff dough)



	Place the spent grains and water into a blender or food processor and blend themn for 30 seconds. Ten put the blended grains in a large mixing bowl and add oil, sugar, salt and stir in the yeast. Addflour until you have a thick, workable dough. Put in a warm place to rise until doubled in size. The knead the dough and divide into three greased laof pans. Let the dough double in size again then bae in a preheated oven at 350 degrees for 

In [47]:
# Open file at output_data/anotherfile.txt and delete the content.

a = []
for i in range(10):
    a.append("All work and no play makes Jack a dull boy. \nSecond Line. \n");
f = open("./output_data/anotherfile.txt", 'w')
for line in a:
    f.write(line)
f.close()

## With Statement
[Return to Table of Contents](#Table-of-Contents)

The "with" keyword sets up a Context Manager, which temporarily deals with how the code runs. In this case, it closes the file automatically when the clause is left. 

Documentation References:
- With Statement: https://docs.python.org/3/reference/compound_stmts.html#the-with-statement
- Compound statements: https://docs.python.org/3/reference/compound_stmts.html

In [None]:
# Same as before but using the breadd.txt file and list comprehensions and printing only up to line 4.
lines = []

with open('./input_data/bread.txt') as f:
    lines = [line for line in f]
print(lines[:4])

In [None]:
type(lines) # Still returns a list object.

In [None]:
# To remove the new line characters (\n) can use a .rstrip() function within the for loop.
# To remove the tab character (\t) can use a .lstrip() function for loop.
# To remove both can use the strip in the for loop.
lines = []

with open('./input_data/bread.txt') as f:
    lines = [line.strip() for line in f]

print(lines[:4])

#### Example with Numerical Data in a TXT File
[Return to Table of Contents](#Table-of-Contents)

In [None]:
with open("./input_data/ex1data1.txt") as f:
    for line in f:
        print(line)

In [None]:
type(line) # Note that the line variable is a string.

In [None]:
lines = [] # Initializing an empty list.

with open("./input_data/ex1data1.txt") as f:
    for line in f:
        lines.append(line)
print(lines)

In [None]:
type(lines) # Notes that in this case the lines are a list and can be accessed by element.

In this case where each line is two columns of data we may need to use a different approach. 

We can use the split function to separate the nubmers by the comma and may also need to change the string to a number (i.e., float in this case). Could later use Pandas and transform the data to a dataframe.

Continue to the OS Library below to see how to process the data in this example.

In [None]:
f = open("./input_data/ex1data1.txt")
data = []

for line in f:
    parsed_line = str.split(line,",") # Line or row of data in this example.
    data_line = []
    for element in parsed_line:
        data_line.append(float(element))
    data.append(data_line)
print(data)
f.close()

# This produces a list within list where the inner list is the two columsn of data and the outer is the row.

In [None]:
type(data)

In [None]:
len(data) # Think number of rows.

In [None]:
len(data[0]) # Think number of columns.

In [None]:
type(data[0]) # Row of data.

In [None]:
print(type(data[0][0])) # Element within row of data.
print(data[0][0])

# JSON Library: JSON Files (OPTIONAL) 
[Return to Table of Contents](#Table-of-Contents)

In previous lectures we have explored how to export dictionary (i.e., JSON) to a Pandas Dataframes and viceversa. This section provide furhter examples on how to work with JSON data using the JSON library.

In [None]:
# Writing JSON to a FILE
data = {}  
data['people'] = []  
data['people'].append({  
    'name': 'Scott',
    'website': 'umbc.edu',
    'from': 'Maryland'
})
data['people'].append({  
    'name': 'Larry',
    'website': 'google.com',
    'from': 'Michigan'
})
data['people'].append({  
    'name': 'Tim',
    'website': 'apple.com',
    'from': 'Alabama'
})
#!mkdir outputs
with open('./output_data/json_data.txt', 'w') as outfile:
    json.dump(data, outfile)

In [None]:
# Reading JSON from a TXT File
with open('./output_data/json_data.txt') as json_file:  
    data = json.load(json_file)
    for p in data['people']:
        print('Name: ' + p['name'])
        print('Website: ' + p['website'])
        print('From: ' + p['from'])
        print('')

In [None]:
type(data)

In [None]:
# JSON Read
f = open('./input_data/geo_data.json')
data = json.load(f)
f.close()    
print(data)
print('____________________________________')
print(data["features"])
print('____________________________________')
print(data["features"][0]["geometry"])
print('____________________________________')
for i in data["features"]:
    print(i["geometry"]["coordinates"][0])

# Working with CSV Files (CSV Library)
[Return to Table of Contents](#Table-of-Contents)

Note that previously we have learned to work with CSV data using the Pandas library and the Pandas library is the prefered approach. However, the CSV library exists and can be usefule too.

In [None]:
f = open('./input_data/csv_plain_file.csv', newline='') 
reader = csv.reader(f, quoting = csv.QUOTE_NONNUMERIC)
for row in reader: # A list of rows
    for value in row: # A list of value
        print(value) # Floats
f.close() # Don't close until you are done with the reader;

In [None]:
with open('./input_data/addresses.csv') as csvfile:
    readCSV = csv.reader(csvfile, delimiter=',')
    firstnames, lastnames, streets, citys, states,zipcodes = [], [], [],[], [], []
        
    for row in readCSV:
        firstname, lastname, street = row[0], row[1], row[2] 
        city, state, zipcode  = row[3], row[4], row[5]  

        firstnames.append(firstname)
        lastnames.append(lastname)
        zipcodes.append(zipcode)

    print(firstnames)
    print(lastnames)
    print(zipcodes)

In [None]:
# let's create a new csv file from an existing one
# New file will contain only the information that we want/need
f2 = open('./output_data/dataout1.csv', 'w', newline='') 
writer = csv.writer(f2, delimiter=',')

with open('./input_data/addresses.csv') as csvfile:
    readCSV = csv.reader(csvfile, delimiter=',')
    firstnames, lastnames, streets, citys, states,zipcodes = [], [], [],[], [], []
        
    for row in readCSV:
        firstname, lastname, street = row[0], row[1], row[2] 
        city, state, zipcode  = row[3], row[4], row[5]
        data2write = (firstname,lastname,state)
        print(data2write)
        writer.writerow(data2write)
        
    f2.close()

In [None]:
# Let's read csv and write txt

with open('./input_data/addresses.csv') as csvfile:
    f2 = open("./output_data/textout.txt", "w")

    readCSV = csv.reader(csvfile, delimiter=',')
    firstnames, lastnames, streets, citys, states,zipcodes = [], [], [],[], [], []
        
    for row in readCSV:
        firstname, lastname, street = row[0], row[1], row[2] 
        city, state, zipcode  = row[3], row[4], row[5]
        text2write = firstname+' lives in '+ city +', '+state
        print(text2write)
        f2.write(text2write+'\n')
        
    f2.close()

# Reading Multiple TXT Files (OPTIONAL)
[Return to Table of Contents](#Table-of-Contents)

In [None]:
# With Fileinput library
a = ["./input_data/text_files/example3.txt", 
     "./input_data/text_files/Genescan.txt", 
     "./input_data/text_files/FAME.txt"]
b = fileinput.input(a)
for line in b:
    print(b.filename())
    print(line)
b.close()

# NOTEBOOK END