# File I/O
Reading and writing files.

## Working with paths

In [1]:
from pathlib import Path
!wget --no-check-certificate 'https://ia600508.us.archive.org/34/items/data_20231009/data.zip' -O data.zip

--2023-10-22 05:51:07--  https://ia600508.us.archive.org/34/items/data_20231009/data.zip
Resolving ia600508.us.archive.org (ia600508.us.archive.org)... 207.241.227.188
Connecting to ia600508.us.archive.org (ia600508.us.archive.org)|207.241.227.188|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 261631 (255K) [application/zip]
Saving to: ‘data.zip’


2023-10-22 05:51:07 (1.28 MB/s) - ‘data.zip’ saved [261631/261631]



In [2]:
!unzip data
!rm -r __MACOSX/
!rm data.zip

Archive:  data.zip
   creating: data/
  inflating: __MACOSX/._data         
  inflating: data/london_weather.csv  
  inflating: __MACOSX/data/._london_weather.csv  
  inflating: data/supermarket_sales - Sheet1.csv  
  inflating: __MACOSX/data/._supermarket_sales - Sheet1.csv  


In [3]:
#get the current path
current_file = Path("10_file_io.ipynb").resolve()

#get the parent directory
current_dir = current_file.parent


### Excercise
Try to navigate to the data folder.

In [4]:
# The path to data directory MUST be stored to data_dir
data_dir =current_dir/'data'

## Path methods

*   `path.exists()`   : chceck if the path exists
*   `path.is_file()`  : check if path belongs to a file
*   `path.is_dir()`   : check if the path belongs to a directory

All these methods will return a boolean value.




In [5]:

print(f"exists: {data_dir.exists()}")
# try the other methods yourslef.


exists: True


### Navigate to `london_weather.csv`
create `data_path` and store the full path of `london_weather.csv`  

In [6]:
data_path = '/content/data/london_weather.csv'

### Reading files

In [7]:
data = {}
i = 0
with open(data_path) as example:
  for line in example:
    data[i] = line
    i=i+1

In [9]:
print(data)

{0: 'date,cloud_cover,sunshine,global_radiation,max_temp,mean_temp,min_temp,precipitation,pressure,snow_depth\n', 1: '19790101,2.0,7.0,52.0,2.3,-4.1,-7.5,0.4,101900.0,9.0\n', 2: '19790102,6.0,1.7,27.0,1.6,-2.6,-7.5,0.0,102530.0,8.0\n', 3: '19790103,5.0,0.0,13.0,1.3,-2.8,-7.2,0.0,102050.0,4.0\n', 4: '19790104,8.0,0.0,13.0,-0.3,-2.6,-6.5,0.0,100840.0,2.0\n', 5: '19790105,6.0,2.0,29.0,5.6,-0.8,-1.4,0.0,102250.0,1.0\n', 6: '19790106,5.0,3.8,39.0,8.3,-0.5,-6.6,0.7,102780.0,1.0\n', 7: '19790107,8.0,0.0,13.0,8.5,1.5,-5.3,5.2,102520.0,0.0\n', 8: '19790108,8.0,0.1,15.0,5.8,6.9,5.3,0.8,101870.0,0.0\n', 9: '19790109,4.0,5.8,50.0,5.2,3.7,1.6,7.2,101170.0,0.0\n', 10: '19790110,7.0,1.9,30.0,4.9,3.3,1.4,2.1,98700.0,0.0\n', 11: '19790111,1.0,6.8,55.0,2.9,2.6,0.3,2.3,98960.0,0.0\n', 12: '19790112,3.0,6.4,54.0,2.0,0.4,-2.0,0.0,100650.0,1.0\n', 13: '19790113,1.0,7.0,57.0,4.3,-2.6,-7.1,0.0,102350.0,1.0\n', 14: '19790114,7.0,0.0,14.0,6.7,-0.6,-5.6,0.8,102700.0,1.0\n', 15: '19790115,,0.0,15.0,5.9,3.8,1.0,0.

## Writing files

In [10]:
new_file_path = data_dir / "new_file.txt"

with open(new_file_path, "w") as my_file:
    my_file.write("This is my first file that I wrote with Python.")

Now go and check that there is a new_file.txt in the data directory. After that you can delete the file by:

In [11]:
if new_file_path.exists():  # make sure it's there
  new_file_path.unlink()

## Importing Pandas package


*   Why we need packages in python?

> A typical Python program is made up of several source files. Each source file is a module, grouping code and data for reuse. Modules are normally independent of each other, so that other programs can reuse the specific modules they need. Sometimes, to manage complexity, developers group together related modules into a package—a hierarchical, tree-like structure of related modules and subpackages.

*   What is Pandas ?

> pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.






In [12]:
import pandas as pd

## Reading data files

In [13]:
weather = pd.read_csv(data_path)

### Using Pandas methods to manipulate data

 Dataframe head and tail:

*  `Dataframe.tail()`
*  `Dataframe.head()`



In [14]:
weather.head()

Unnamed: 0,date,cloud_cover,sunshine,global_radiation,max_temp,mean_temp,min_temp,precipitation,pressure,snow_depth
0,19790101,2.0,7.0,52.0,2.3,-4.1,-7.5,0.4,101900.0,9.0
1,19790102,6.0,1.7,27.0,1.6,-2.6,-7.5,0.0,102530.0,8.0
2,19790103,5.0,0.0,13.0,1.3,-2.8,-7.2,0.0,102050.0,4.0
3,19790104,8.0,0.0,13.0,-0.3,-2.6,-6.5,0.0,100840.0,2.0
4,19790105,6.0,2.0,29.0,5.6,-0.8,-1.4,0.0,102250.0,1.0


In [15]:
#Tray tail() yourself
weather.tail()

Unnamed: 0,date,cloud_cover,sunshine,global_radiation,max_temp,mean_temp,min_temp,precipitation,pressure,snow_depth
15336,20201227,1.0,0.9,32.0,7.5,7.5,7.6,2.0,98000.0,
15337,20201228,7.0,3.7,38.0,3.6,1.1,-1.3,0.2,97370.0,
15338,20201229,7.0,0.0,21.0,4.1,2.6,1.1,0.0,98830.0,
15339,20201230,6.0,0.4,22.0,5.6,2.7,-0.1,0.0,100200.0,
15340,20201231,7.0,1.3,34.0,1.5,-0.8,-3.1,0.0,100500.0,


### `Dataframe.info()`
Pandas dataframe.info() function is used to get a concise summary of the dataframe. It comes really handy when doing exploratory analysis of the data. To get a quick overview of the dataset we use the dataframe.info() function.

In [16]:
# show the summary of weather dataframe.
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15341 entries, 0 to 15340
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date              15341 non-null  int64  
 1   cloud_cover       15322 non-null  float64
 2   sunshine          15341 non-null  float64
 3   global_radiation  15322 non-null  float64
 4   max_temp          15335 non-null  float64
 5   mean_temp         15305 non-null  float64
 6   min_temp          15339 non-null  float64
 7   precipitation     15335 non-null  float64
 8   pressure          15337 non-null  float64
 9   snow_depth        13900 non-null  float64
dtypes: float64(9), int64(1)
memory usage: 1.2 MB


### `Dataframe.shape()`
Shape shows the number of dimensions as well as the size in each dimension.

In [18]:
# show the shape of weather
weather.shape

(15341, 10)

### `Dataframe.describe()`
It does a quick statistical summary for every numerical column

In [19]:
# show quick statistical summary of weather
weather.describe()

Unnamed: 0,date,cloud_cover,sunshine,global_radiation,max_temp,mean_temp,min_temp,precipitation,pressure,snow_depth
count,15341.0,15322.0,15341.0,15322.0,15335.0,15305.0,15339.0,15335.0,15337.0,13900.0
mean,19995670.0,5.268242,4.350238,118.756951,15.388777,11.475511,7.559867,1.668634,101536.605594,0.037986
std,121217.6,2.070072,4.028339,88.898272,6.554754,5.729709,5.326756,3.73854,1049.722604,0.545633
min,19790100.0,0.0,0.0,8.0,-6.2,-7.6,-11.8,0.0,95960.0,0.0
25%,19890700.0,4.0,0.5,41.0,10.5,7.0,3.5,0.0,100920.0,0.0
50%,20000100.0,6.0,3.5,95.0,15.0,11.4,7.8,0.0,101620.0,0.0
75%,20100700.0,7.0,7.2,186.0,20.3,16.0,11.8,1.6,102240.0,0.0
max,20201230.0,9.0,16.0,402.0,37.9,29.0,22.3,61.8,104820.0,22.0


### `nsmallest()` and `nlargest()`
It shows the largest or smallest n values of a specific column in a dataframe.

In [20]:
weather.nsmallest(5,'min_temp')

Unnamed: 0,date,cloud_cover,sunshine,global_radiation,max_temp,mean_temp,min_temp,precipitation,pressure,snow_depth
1077,19811213,7.0,0.0,12.0,6.6,-3.6,-11.8,15.7,99610.0,20.0
1109,19820114,0.0,4.6,47.0,5.1,-4.2,-10.1,0.0,102620.0,11.0
2597,19860210,0.0,7.4,86.0,1.1,-4.4,-9.6,0.0,103150.0,3.0
11676,20101220,7.0,0.4,17.0,1.8,-4.1,-9.4,0.0,100120.0,5.0
2934,19870113,5.0,0.4,20.0,-1.4,-6.2,-9.1,3.0,101380.0,1.0


### excercise


> Show the highest five days of snow depth in london.



In [21]:
weather.nlargest(5,'snow_depth')

Unnamed: 0,date,cloud_cover,sunshine,global_radiation,max_temp,mean_temp,min_temp,precipitation,pressure,snow_depth
1076,19811212,1.0,5.8,46.0,4.5,-4.4,-8.4,0.0,100670.0,22.0
1077,19811213,7.0,0.0,12.0,6.6,-3.6,-11.8,15.7,99610.0,20.0
1105,19820110,5.0,2.8,35.0,1.7,-3.6,-6.3,0.0,101110.0,18.0
1104,19820109,8.0,0.0,14.0,-0.8,-2.6,-3.6,1.9,101450.0,16.0
1106,19820111,4.0,7.0,56.0,2.4,-1.8,-5.3,0.0,100750.0,15.0


### `Dataframe.query()`
We sometimes need to filter a dataframe based on a condition or apply a mask to get certain values. One easy way to filter a dataframe is query function.

In [22]:
weather.query('-3<min_temp<0')

Unnamed: 0,date,cloud_cover,sunshine,global_radiation,max_temp,mean_temp,min_temp,precipitation,pressure,snow_depth
4,19790105,6.0,2.0,29.0,5.6,-0.8,-1.4,0.0,102250.0,1.0
11,19790112,3.0,6.4,54.0,2.0,0.4,-2.0,0.0,100650.0,1.0
17,19790118,8.0,0.0,15.0,3.0,0.8,-0.2,0.2,101860.0,0.0
18,19790119,8.0,0.0,16.0,7.2,0.8,-1.4,5.2,100910.0,0.0
19,19790120,7.0,0.0,16.0,3.5,3.1,-1.0,0.0,100920.0,0.0
...,...,...,...,...,...,...,...,...,...,...
15306,20201127,7.0,0.8,34.0,6.5,2.9,-0.6,0.2,101970.0,
15316,20201207,8.0,1.0,24.0,2.2,0.9,-0.3,3.0,99940.0,
15335,20201226,,2.1,38.0,10.0,4.9,-0.1,12.0,101960.0,
15337,20201228,7.0,3.7,38.0,3.6,1.1,-1.3,0.2,97370.0,


### Excercise

> Show dates where minimum temperature was less than -11 and snow depth was more than 10



In [37]:
weather.query('min_temp<-11 and snow_depth>10')

Unnamed: 0,date,cloud_cover,sunshine,global_radiation,max_temp,mean_temp,min_temp,precipitation,pressure,snow_depth
1077,19811213,7.0,0.0,12.0,6.6,-3.6,-11.8,15.7,99610.0,20.0
