<a href="https://colab.research.google.com/github/davidofitaly/notes_03_python_in_data_analysis/blob/main/03_reading_and_writing_data%2C_file_formats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [80]:
import pandas as pd  # Importing pandas for data manipulation


##Reading and writing data in text format

### Reading and Writing Data in Text Format


#####Pandas provides functions to read and write structured text data, such as CSV, TSV, and JSON

##### **Writing Data**
- **CSV Format**:  
  `df.to_csv("output.csv", index=False)`
- **TSV Format**:  
  `df.to_csv("output.tsv", sep="\t", index=False)`
- **JSON Format**:  
  `df.to_json("output.json", orient="records")`

##### **Reading Data**
- **CSV File**:  
  `df = pd.read_csv("output.csv")`
- **TSV File**:  
  `df = pd.read_csv("output.tsv", sep="\t")`
- **JSON File**:  
  `df = pd.read_json("output.json")`



####Examples 4.1



*   ex1



In [8]:
# Define the URL of the CSV file hosted on GitHub
url_ex_1 = 'https://raw.githubusercontent.com/davidofitaly/notes_03_python_in_data_analysis/main/files_and_folders/examples/ex1.csv'

# Read the CSV file into a DataFrame
df_ex_1 = pd.read_csv(url_ex_1)

# Display the DataFrame
df_ex_1

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo




*  ex2




In [10]:
# Define the URL of the CSV file hosted on GitHub
url_ex_2 = 'https://raw.githubusercontent.com/davidofitaly/notes_03_python_in_data_analysis/main/files_and_folders/examples/ex2.csv'

# Read the CSV file into a DataFrame
df_ex_2 = pd.read_csv(url_ex_2)

# Display the DataFrame
df_ex_2

Unnamed: 0,1,2,3,4,hello
0,5,6,7,8,world
1,9,10,11,12,foo


In [12]:
pd.read_csv(url_ex_2, header=None) # Reads a CSV file from the given URL (url_ex_2) without considering any row as the header

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [18]:
pd.read_csv(url_ex_2, names=['A', 'B', 'C', 'D', 'Message']) # Reads a CSV file from the given URL (url_ex_2) and assigns custom column names

Unnamed: 0,A,B,C,D,Message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [20]:
names = ['A', 'B', 'C', 'D', 'Message']
pd.read_csv(url_ex_2, names=names, index_col='Message') # Reads a CSV file from the given URL (url_ex_2) and sets the 'Message' column as the index

Unnamed: 0_level_0,A,B,C,D
Message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12




*   csv_mindex



In [22]:
# Define the URL of the CSV file hosted on GitHub
url_mindex = 'https://raw.githubusercontent.com/davidofitaly/notes_03_python_in_data_analysis/main/files_and_folders/examples/csv_mindex.csv'

# Read the CSV file into a DataFrame
df_mindex = pd.read_csv(url_mindex)

# Display the DataFrame
df_mindex

Unnamed: 0,key1,key2,value1,value2
0,one,a,1,2
1,one,b,3,4
2,one,c,5,6
3,one,d,7,8
4,two,a,9,10
5,two,b,11,12
6,two,c,13,14
7,two,d,15,16


In [23]:
pd.read_csv(url_mindex, index_col=['key1', 'key2']) # Reads a CSV file from the given URL (url_mindex) and sets multiple columns as the index

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16




*   ex3



In [29]:
# Define the URL of the CSV file hosted on GitHub
url_ex_3 = 'https://raw.githubusercontent.com/davidofitaly/notes_03_python_in_data_analysis/main/files_and_folders/examples/ex3.txt'

# Read the CSV file into a DataFrame
df_ex_3 = pd.read_table(url_ex_3)

# Display the DataFrame
df_ex_3

Unnamed: 0,A B C
0,aaa -0.264438 -1.026059 -0.619500
1,bbb 0.927272 0.302904 -0.032399
2,ccc -0.264273 -0.386314 -0.217601
3,ddd -0.871858 -0.348382 1.100491


In [30]:
pd.read_table(url_ex_3, sep='\s+') # Reads a CSV file from the given URL (url

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491




*   ex4



In [33]:
# Define the URL of the CSV file hosted on GitHub
url_ex_4 = 'https://raw.githubusercontent.com/davidofitaly/notes_03_python_in_data_analysis/main/files_and_folders/examples/ex4.csv'

# Read the CSV file into a DataFrame
df_ex_4 = pd.read_csv(url_ex_4)

# Display the DataFrame
df_ex_4

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,# Cześć!
a,b,c,d,message
# Chciałem tylko trochę utrudnić Twoją pracę.,,,,
# Kto w ogóle wczytuje pliki CSV za pomocą komputera?,,,,
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [35]:
pd.read_csv(url_ex_4, skiprows=[0, 2, 3]) # Reads a CSV file from the given URL (url_ex_4)

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo




*   ex5


In [36]:
# Define the URL of the CSV file hosted on GitHub
url_ex_5 = 'https://raw.githubusercontent.com/davidofitaly/notes_03_python_in_data_analysis/main/files_and_folders/examples/ex5.csv'

# Read the CSV file into a DataFrame
df_ex_5 = pd.read_csv(url_ex_5)

# Display the DataFrame
df_ex_5

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [37]:
pd.isnull(df_ex_5) # Checks for missing values in the DataFrame

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


In [40]:
pd.read_csv(url_ex_5, na_values=['NULL']) # Reads a CSV file from the given URL (url_ex_5) and treats 'NULL' values as missing data (NaN)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


### Reading a Part of a Text File in Panda

##### Pandas allows loading only a part of a text file, which is useful when working with large datasets that cannot fit into memory.

##### **Reading a Specific Number of Rows**
- **First n rows**:  
  To read only the first `n` rows of the file, use the `nrows` argument.

##### **Skipping Rows**
- **Skip first n rows**:  
  To skip the first `n` rows in the file, use the `skiprows` argument.

##### **Reading Specific Columns**
- **Specifying columns to read**:  
  To read only selected columns, use the `usecols` argument.

##### **Reading in Chunks**
- **Loading data in chunks**:  
  To read the data in chunks, use the `chunksize` argument. This allows processing large files without loading everything into memory at once.



####Examples 4.2

In [53]:
pd.options.display.max_rows = 8



*   ex6


In [54]:
# Define the URL of the CSV file hosted on GitHub
url_ex_6 = 'https://raw.githubusercontent.com/davidofitaly/notes_03_python_in_data_analysis/main/files_and_folders/examples/ex6.csv'

# Read the CSV file into a DataFrame
df_ex_6 = pd.read_csv(url_ex_6)

# Display the DataFrame
df_ex_6

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
...,...,...,...,...,...
9996,-0.479893,-0.650419,0.745152,-0.646038,E
9997,0.523331,0.787112,0.486066,1.093156,K
9998,-0.362559,0.598894,-1.843201,0.887292,G
9999,-0.096376,-1.012999,-0.657431,-0.573315,0


In [56]:
pd.read_csv(url_ex_6, nrows=7) # Reads a CSV file from the given URL (url_ex_6) and loads only the first 7 rows

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
5,1.81748,0.742273,0.419395,-2.251035,Q
6,-0.776764,0.935518,-0.332872,-1.875641,U


### Saving Data in Text Format in Pandas

##### Saving Data in Text Formats

Pandas offers functions to save DataFrames in text formats such as CSV, TSV, and JSON.

- **`to_csv()`**: Saves data in CSV format. You can exclude the index, set a custom delimiter (e.g., tab for TSV), and handle missing values.
  
- **`to_json()`**: Saves data in JSON format, with customizable structure using the `orient` parameter.

- **Common Options**:  
  - `index=False`: Excludes the index.  
  - `header=False`: Excludes column names.  
  - `na_rep='NA'`: Replaces missing values with a string.  
  - `columns=["col1", "col2"]`: Saves selected columns.


####Examples 4.3



*   ex5


In [57]:
# Define the URL of the CSV file hosted on GitHub
url_ex_5 = 'https://raw.githubusercontent.com/davidofitaly/notes_03_python_in_data_analysis/main/files_and_folders/examples/ex5.csv'

# Read the CSV file into a DataFrame
df_ex_5 = pd.read_csv(url_ex_5)

# Display the DataFrame
df_ex_5

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [59]:
df_ex_5.to_csv("ex5_out_1.csv") # Saves the DataFrame (df_ex_5) to a CSV file named "ex5_out_1.csv"

In [60]:
df_ex_5.to_csv("ex5_out_2.csv", index=False, header=False) # Saves the DataFrame (df_ex_5) to a CSV file named "ex5_out_2.csv" without including the index

In [61]:
df_ex_5.to_csv("ex5_out_3.csv", na_rep="NULL") # Saves the DataFrame (df_ex_5) to a CSV file named "ex5_out_3.csv" with missing values replaced

In [62]:
df_ex_5.to_csv("ex5_out_4.csv", index=False, columns=["a", "b", "c"]) # Saves the DataFrame (df_ex_5) to a CSV file named "ex5_out_4.csv"

### Data in JSON Format

#####JSON (JavaScript Object Notation) is a lightweight, flexible format for storing and exchanging data. It is commonly used for hierarchical or nested data structures.

#####Pandas provides the `to_json()` function to save DataFrames in JSON format.

- **Saving Data as JSON**:  
  The `to_json()` function allows saving data in JSON format. The structure of the JSON file can be controlled using the `orient` parameter, which specifies how the DataFrame is organized in JSON format.

- **Common `orient` Options**:  
  - `records`: Each row is saved as a dictionary.
  - `columns`: Saves DataFrame columns as key-value pairs.
  - `index`: Saves DataFrame indices as keys.
  
- **Reading JSON Files**:  
  To load data from a JSON file into a DataFrame, use the `read_json()` function.



####Examples 4.4

In [64]:
# URL pointing to the example JSON file
url_json = 'https://raw.githubusercontent.com/davidofitaly/notes_03_python_in_data_analysis/main/files_and_folders/examples/example.json'

# Loading the JSON file into a DataFrame
df_json = pd.read_json(url_json)

# Displaying the DataFrame to check its contents
df_json

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


### XML and HTML - Web Scraping with `read_html`

#####The `read_html()` function in Pandas allows you to directly read HTML tables from a webpage or HTML file into a DataFrame. It automatically detects tables and converts them into DataFrames.

- **Usage**: It can read tables from both local HTML files and URLs.
- **Limitations**: Works best with well-structured HTML tables; complex or heavily formatted tables may not be parsed correctly.
- **Additional Options**: You can adjust parameters like `header` and `attrs` to fine-tune table extraction.

`read_html()` simplifies extracting structured data from HTML tables for analysis.


####Examples 4.5

In [81]:
# URL pointing to the example HTML file
url_html = 'https://raw.githubusercontent.com/davidofitaly/notes_03_python_in_data_analysis/main/files_and_folders/examples/fdic_failed_bank_list.html'

# Load the HTML content into a list of DataFrames
df_html = pd.read_html(url_html)

# Access the first DataFrame in the list and display the first 10 rows
df_html_first_table = df_html[0]
df_html_first_table.head(10)

# Set display options to show more rows and columns if necessary
pd.set_option('display.max_rows', 20)  # Adjust the number of rows shown
pd.set_option('display.max_columns', 10)  # Adjust the number of columns shown

# Display the DataFrame with updated options
df_html_first_table


Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Allied Bank,Mulberry,AR,91,Today's Bank,"September 23, 2016","November 17, 2016"
1,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,"August 19, 2016","November 17, 2016"
2,First CornerStone Bank,King of Prussia,PA,35312,First-Citizens Bank & Trust Company,"May 6, 2016","September 6, 2016"
3,Trust Company Bank,Memphis,TN,9956,The Bank of Fayette County,"April 29, 2016","September 6, 2016"
4,North Milwaukee State Bank,Milwaukee,WI,20364,First-Citizens Bank & Trust Company,"March 11, 2016","June 16, 2016"
...,...,...,...,...,...,...,...
542,"Superior Bank, FSB",Hinsdale,IL,32646,"Superior Federal, FSB","July 27, 2001","August 19, 2014"
543,Malta National Bank,Malta,OH,6629,North Valley Bank,"May 3, 2001","November 18, 2002"
544,First Alliance Bank & Trust Co.,Manchester,NH,34264,Southern New Hampshire Bank & Trust,"February 2, 2001","February 18, 2003"
545,National State Bank of Metropolis,Metropolis,IL,3815,Banterra Bank of Marion,"December 14, 2000","March 17, 2005"


### Data Formats - Binary Formats (focus on `read_pickle`)



#####Binary formats in Python, such as **Pickle**, allow for efficient storage and retrieval of Python objects. These formats serialize Python objects into binary data, preserving their structure and data types. The most common use case is storing and loading complex Python objects like DataFrames, lists, or dictionaries.

- **Pickle Format**:
  - Pickle is a binary format used by Python to serialize and deserialize Python objects.
  - It allows you to save Python objects to a file and load them back later without losing their structure or type information.
  - The function `read_pickle()` is used to load a previously pickled (serialized) object from a binary file.

- **`read_pickle()`**:
  - This function is used to load objects stored in the Pickle format.
  - It is typically used with Pandas DataFrames to efficiently save and reload data without needing to reprocess raw data.



####Examples 4.6

In [None]:
# Define the URL of the CSV file hosted on GitHub
url_ex_1 = 'https://raw.githubusercontent.com/davidofitaly/notes_03_python_in_data_analysis/main/files_and_folders/examples/ex1.csv'

# Read the CSV file into a DataFrame
df_ex_1 = pd.read_csv(url_ex_1)

# Display the DataFrame
df_ex_1

In [82]:
df_ex_1.to_pickle("df_ex_1_pickle") # Saves the DataFrame (df_ex_1) to a pickle file named "df_ex_1_pickle"

In [83]:
pd.read_pickle("df_ex_1_pickle") # Loads the DataFrame from the pickle file "df_ex_1_pickle"

Unnamed: 0,1,2,3,4,hello
0,5,6,7,8,world
1,9,10,11,12,foo


### Reading Excel Files

#####The `pd.read_excel()` function in Pandas allows reading Excel files (.xls, .xlsx) into a DataFrame.

- **Key Features**:
  - Supports `.xls` and `.xlsx` formats.
  - Can read specific sheets or multiple sheets.
  - Allows customization for columns, headers, and missing values.

- **Common Arguments**:
  - `sheet_name`: Specifies which sheet to read.
  - `header`: Row(s) to use as column names.
  - `index_col`: Column to set as the DataFrame index.



####Examples 4.7

In [93]:
# Define the URL of the CSV file hosted on GitHub
url_excel = 'https://raw.githubusercontent.com/davidofitaly/notes_03_python_in_data_analysis/main/files_and_folders/examples/ex1.xlsx'

# Read the CSV file into a DataFrame
df_excel = pd.read_excel(url_excel, index_col=0)

# Display the DataFrame
df_excel

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [94]:
df_excel.to_excel("df_excel_out.xlsx") # Saves the DataFrame (df_excel) to an Excel file named "df_excel_out.xlsx"