<a href="https://colab.research.google.com/github/baharkarami/Text-Mining-Class/blob/main/working_with_file.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with Files

Both lines open the file in read-text mode, and there is no functional difference between them.

**`"rt"`** stands for **"read, text"**, meaning the file is opened in text mode for reading:

In [1]:
f = open("demofile.txt")

In [2]:
f = open("demofile.txt", "rt")



---



This code opens the file `demofile.txt` in read mode (`"r"`) and uses the **`read()`** method to read and display its content:

In [3]:
f = open("demofile.txt", "r")
print(f.read())

Hello! Welcome to demofile.txt
This file is for testing purposes.
Good Luck!




---



This code opens the file `demofile.txt` in read mode (`"r"`) and reads only the first 5 characters of the file using `read(5)`, then displays them:

In [4]:
f = open("demofile.txt", "r")
print(f.read(5))

Hello




---



This code opens the file demofile.txt in append mode (`"a"`), adds the text `"Now the file has more content!"` to the file, and then closes the file.

The **`f.close()`** statement closes the opened file. It is necessary to release system resources and ensure that changes to the file are saved.

After that, the file is reopened in read mode (`"r"`) and its content is displayed:

In [5]:
f = open("demofile.txt", "a")
f.write("Now the file has more content!")
f.close()

#open and read the file after the appending:
f = open("demofile.txt", "r")
print(f.read())

Hello! Welcome to demofile.txt
This file is for testing purposes.
Good Luck!Now the file has more content!




---



This code opens the file `demofile.txt` in write mode (`"w"`), deletes its previous content, writes the text `"Woops! I have deleted the content!"`, and then closes the file.
After that, the file is reopened in read mode (`"r"`) and its new content is displayed:

In [6]:
f = open("demofile.txt", "w")
f.write("Woops! I have deleted the content!")
f.close()

#open and read the file after the appending:
f = open("demofile.txt", "r")
print(f.read())

Woops! I have deleted the content!




---



This code uses the **`os`** module to delete the file `demofile.txt`. After executing this command, the file is removed.

In [7]:
import os
os.remove("demofile.txt")



---



This code first checks if the file `demofile.txt` exists. If the file exists, it deletes it. Otherwise, it prints the message `"The file does not exist"`.

In [8]:
import os
if os.path.exists("demofile.txt"):
  os.remove("demofile.txt")
else:
  print("The file does not exist")

The file does not exist




---




This code opens the file `sample-file.txt` in append mode (`'a'`) with `utf-8` encoding, adds the text `"\nسلام"`, and then closes the file.

`encoding='utf-8'` specifies that the file should be opened or saved using the UTF-8 encoding. This encoding supports most languages, including special characters like Persian letters.

Then, the file is reopened in read mode (`'r'`), its content is stored in the variable `txt`, and then printed.

In [9]:
f = open("sample-file.txt", mode='a', encoding='utf-8')
f.write("\nسلام")
f.close()

f = open("sample-file.txt", mode='r', encoding='utf-8')
txt = f.read()
print(txt)


سلام
سلام
سلام
سلام
سلام
سلام




---



1. Reading the entire file content: In the first section, the file sample-file.txt is opened in read mode (`'r'`) with utf-8 encoding, and all lines of the file are read and printed using **`readlines()`**.

2. Reading the file line by line - Method 1: In the second section, the file is reopened and lines are read one by one using a while loop and the `:=` operator (walrus operator), and printed.


 - `line := f.readline()` assigns the read value to the variable line.

 - while`(line := f.readline())` checks if the line is not empty (if true, the loop continues).

3. Reading the file line by line - Method 2: In the third section, the file is reopened and read line by line using a `for` loop, which automatically iterates over each line in the file and prints it.

In [10]:
print("read total file:")
f = open("sample-file.txt", mode='r', encoding='utf-8')
print(f.readlines())

print("read file line by line  -1:")
f = open("sample-file.txt", mode='r', encoding='utf-8')
while(line := f.readline()):
    print(line)

print("read file line by line  -2:")
f = open("sample-file.txt", mode='r', encoding='utf-8')
for line in f:
    print(line)

read total file:
['\n', 'سلام\n', 'سلام\n', 'سلام\n', 'سلام\n', 'سلام\n', 'سلام']
read file line by line  -1:


سلام

سلام

سلام

سلام

سلام

سلام
read file line by line  -2:


سلام

سلام

سلام

سلام

سلام

سلام




---



The sample.json file is opened, loaded into a dictionary using `json.load()`, and printed.

Keys of the dictionary are printed using a loop.

In the second loop, keys and their values are printed using two formatting methods (`%s` and `f-string`).

In [11]:
import json
with open('sample.json', 'r') as f:
    data = json.load(f)
    print(data)

print("\n-----------------------------\n")

for i in data:
    print(i)

print("\n-----------------------------\n")

for i in data:
    print("%s = %s" % (i, data[i]))
    print(f"{i} = {data[i]}")

{'firstname': 'Bahar', 'lastname': 'Karami', 'age': 21}

-----------------------------

firstname
lastname
age

-----------------------------

firstname = Bahar
firstname = Bahar
lastname = Karami
lastname = Karami
age = 21
age = 21




---




Saving data to a simple JSON file:

  - The file **`data.json`** is opened in **write mode**.
  The **`json.dump()`** method writes the dictionary's data into the file.

  - The **`ensure_ascii=False`** parameter ensures that **non-ASCII characters (like Unicode) are saved without being altered**.

Saving data to a pretty-formatted JSON file:

  - The file **`data.jsonl`** is opened in **write mode**.
  - Data is written with indentation using the **`json.dump()`** method.

  - The **`indent=4`** parameter specifies that each level of **indentation will use 4 spaces**, making the file more readable.

In [12]:
import json

data = {
    "firstname" : "Alice",
    "lastname" : "Hall",
    "age" : "35"
}

with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False)

with open('data.jsonl', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)



---



This code uses the Pandas library to create a data table (DataFrame):

A dictionary named `mydataset` is defined with two keys.

The **`pd.DataFrame()`** method converts the dictionary into a tabular data structure (DataFrame).

The variable `myvar`, which contains the DataFrame, is printed.

In [13]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2




---



This code uses the Pandas library to create a Series:

A list named `**a**` containing three numbers is defined.

The **`pd.Series()`** method converts the list into a Series.

- A Series in Pandas is a one-dimensional array with values and an associated index.

- By default, the indices start from 0.

The variable `myvar`, which holds the Series, is printed.

In [14]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64




---



The **`index`** parameter is used to set custom indices ("x", "y", and "z").

In [15]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

x    1
y    7
z    2
dtype: int64




---



The **`pd.Series()`** method converts the dictionary into a Series.

The dictionary keys become the indices, and the values become the data.

In [16]:
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

day1    420
day2    380
day3    390
dtype: int64




---



The data dictionary contains two keys:
- `calories`: A list of calorie values (420, 380, 390).
- `duration`: A list of duration values (50, 40, 45).
Create a DataFrame:

The **`pd.DataFrame()`** method converts the dictionary into a DataFrame.
- Each key becomes a column, and the values of each key become the data for that column.

The variable `df`, which holds the DataFrame, is printed.

In [17]:
import pandas as pd

data = {
    "calories" : [420, 380, 390],
    "duration" : [50, 40, 45]
}

df = pd.DataFrame(data)

print(df)

   calories  duration
0       420        50
1       380        40
2       390        45




---




The `loc` method is used to access rows by index.

- **`df.loc[[0, 1]]`** means selecting rows with indices 0 and 1.

In [18]:
import pandas as pd

data = {
    "calories" : [420, 380, 390],
    "duration" : [50, 40, 45]
}

df = pd.DataFrame(data)

print(df.loc[[0, 1]])

   calories  duration
0       420        50
1       380        40




---



The `data` dictionary contains two keys, `calories` and `duration`, which are used as columns in the DataFrame.

The `index` parameter in `pd.DataFrame()` assigns custom indices ("day1", "day2", "day3") to the rows.
Select rows with custom indices:

The `loc` method is used to select rows by their indices:
- `df.loc[["day1", "day2"]]` selects rows "`day1`" and "`day`".

In [19]:
import pandas as pd

data = {
    "calories" : [420, 380, 390],
    "duration" : [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df.loc[["day1", "day2"]])

      calories  duration
day1       420        50
day2       380        40




---



**`pd.read_csv('data.csv')`** reads the data from `data.csv` into a DataFrame.

**`df.to_string()`** returns the entire content of the DataFrame as a complete text string.

- If the DataFrame is lengthy, it bypasses the default truncated display to show all rows and columns.

In [20]:
import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string())

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   



---




**`pd.read_json('data1.json')`** reads the file `data1.json` and converts it into a DataFrame.
The JSON file should have a structure similar to Python dictionaries.

**`df.to_string()`** converts the entire content of the DataFrame into a string.

In [21]:
import pandas as pd

df = pd.read_json('data1.json')

print(df.to_string())

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300




---



This code creates a DataFrame in Pandas, saves it as a JSON file with a specific orientation, reads the JSON file back into a DataFrame, and prints the content:


1. The data contains three rows and three columns, defined as a 2D array.
Custom indices are set as `row 1`, `row 2`, and `row 3`.
Columns are named `col1`, `col2`, and `col3`.


2. **`to_json()`** saves the DataFrame to the file `file.json`.
- The parameter **`orient='split'`** specifies a split-oriented format, which includes:
  - columns: Column names
  - index: Row indices
  - data: Actual data
- The parameter **`compression='infer'`** defaults to no compression.

3. **`read_json()`** reads the file `file.json` and converts it back into a DataFrame.
- The format must match `orient='split`'.

4. The contents of the file, reloaded as a DataFrame, are printed.

In [22]:
import pandas as pd

df = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']],
                  index=['row 1', 'row 2', 'row 3'],
                  columns=['col1', 'col2', 'col3'])

df.to_json('file.json', orient='split', compression='infer', index='true')

df = pd.read_json('file.json', orient='split', compression='infer')

print(df)

      col1 col2 col3
row 1    a    b    c
row 2    d    e    f
row 3    g    h    i
