## Opening and reading different file types

Python allows us to read and write to different file types such as .txt, .csv, FITs.

Lets start by reading and writing to .txt files

Python includes the `open()` function, it has different paramenters:
- 'r' read, opens the file, return an error if not found.
- 'a' append, open the file to write at the end of the file, it will not overwrite.
- 'w' write, writes over the existing file, create a new file if not found.
- 'x' create, creates a new file, returns an error if not found.

Included with this lesson there is a .txt file named 'planetary_data.txt', we will use this file for our example.

In [1]:
data = open('planetary_data.txt')

In [2]:
# Object wrapper
data

<_io.TextIOWrapper name='planetary_data.txt' mode='r' encoding='UTF-8'>

In [3]:
data.read()

"Mercury,\tVenus,\tEarth,\tMars,\tJupiter\t\nMean distance from Sun (millions of kilometers),\t57.9,\t108.2,\t149.6,\t227.9,\t778.3\n\t\t\t\t\t\nMean distance from Sun\t36\t67.24\t92.9\t141.71\t483.88\t\t\t\t\t\t\nPeriod of revolution\t88 days\t224.7 days\t365.2 days\t687 days\t11.86 yrs\nInclination of axis\tNear 0°\t3°\t23°27'\t25° 12'\t3° 5'\nInclination of orbit\t7°\t3.4°\t0°\t1.9°\t1.3°\t\t\t\t\t\nEccentricity of orbit\t0.206\t0.007\t0.017\t0.093\t0.048\nEquatorial diameter\t4.880,\t12.100,\t12.756,\t6.794,\t142.800\t\t\t\t\t\nAtmosphere (main components),\tVirtually none,\tCarbon dioxideNitrogenOxygen\tCarbon dioxide,\tHydrogen, helium\t"

In [4]:
# Read the first 10 characters
data = open('planetary_data.txt')
data.read(10)

'Mercury,\tV'

In [5]:
# Read a line
data = open('planetary_data.txt')
data.readline()

'Mercury,\tVenus,\tEarth,\tMars,\tJupiter\t\n'

In [6]:
# Read line by line
data = open('planetary_data.txt')
for x in data:
    print(x)

Mercury,	Venus,	Earth,	Mars,	Jupiter	

Mean distance from Sun (millions of kilometers),	57.9,	108.2,	149.6,	227.9,	778.3

					

Mean distance from Sun	36	67.24	92.9	141.71	483.88						

Period of revolution	88 days	224.7 days	365.2 days	687 days	11.86 yrs

Inclination of axis	Near 0°	3°	23°27'	25° 12'	3° 5'

Inclination of orbit	7°	3.4°	0°	1.9°	1.3°					

Eccentricity of orbit	0.206	0.007	0.017	0.093	0.048

Equatorial diameter	4.880,	12.100,	12.756,	6.794,	142.800					

Atmosphere (main components),	Virtually none,	Carbon dioxideNitrogenOxygen	Carbon dioxide,	Hydrogen, helium	


In [7]:
# Close the reference to the object
data.close()

Lets open a second txt file, which does not include so much text. Open the file called `planetary_data_2.txt`

In [40]:
# Open our second txt file
planets = open('planetary_data_2.txt')
planets.read()

'Body, Radius (km), Mass (10e21 kg), Density (g/cm3), Gravity (m/s2)\nMercury, 2439.4, 60.83, 330.11, 5.4291, 3.70\nVenus,\t6052, 928.43, 4867.5, 5.243, 8.87\nEarth,\t6371.0084, 1083.21, 5972.4, 5.5136, 9.8\t\nMars, 3389.5, 163.18, 641.71, 3.9341, 3.71\nJupiter, 69911, 1431280, 1898187, 1.3262, 24.79'

Let try to split the radii of planets using the `line.split()` method, this will split or separate our data where an occurrence of a string is found, in this case we will use a comma to split our characters.

In [9]:
bodies = []
radii = []

planets = open('planetary_data_2.txt')

for line in planets:
    splitLine = line.split(',')
    print(splitLine) # print the splitted line
    print(line.strip()) # print the stripped line as String
    bodies.append(splitLine[0])
    radii.append(splitLine[1])
#bodies

['Body', ' Radius (km)', ' Mass (10e21 kg)', ' Density (g/cm3)', ' Gravity (m/s2)\n']
Body, Radius (km), Mass (10e21 kg), Density (g/cm3), Gravity (m/s2)
['Mercury', ' 2439.4', ' 60.83', ' 330.11', ' 5.4291', ' 3.70\n']
Mercury, 2439.4, 60.83, 330.11, 5.4291, 3.70
['Venus', '\t6052', ' 928.43', ' 4867.5', ' 5.243', ' 8.87\n']
Venus,	6052, 928.43, 4867.5, 5.243, 8.87
['Earth', '\t6371.0084', ' 1083.21', ' 5972.4', ' 5.5136', ' 9.8\t\n']
Earth,	6371.0084, 1083.21, 5972.4, 5.5136, 9.8
['Mars', ' 3389.5', ' 163.18', ' 641.71', ' 3.9341', ' 3.71\n']
Mars, 3389.5, 163.18, 641.71, 3.9341, 3.71
['Jupiter', ' 69911', ' 1431280', ' 1898187', ' 1.3262', ' 24.79']
Jupiter, 69911, 1431280, 1898187, 1.3262, 24.79


In [10]:
bodies

['Body', 'Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter']

In [11]:
radii

[' Radius (km)', ' 2439.4', '\t6052', '\t6371.0084', ' 3389.5', ' 69911']

### Short exercise
Add all the masses from the planets found in the file planetary_data_2.txt. You can use a for loop similar to the code above.

In [13]:
masses = []

planets = open('planetary_data_2.txt')

for line in planets:
    splitLine = line.split(',')
    print(splitLine) # print the splitted line
    print(line.strip()) # print the stripped line as String
    masses.append(splitLine[2])
    
masses

['Body', ' Radius (km)', ' Mass (10e21 kg)', ' Density (g/cm3)', ' Gravity (m/s2)\n']
Body, Radius (km), Mass (10e21 kg), Density (g/cm3), Gravity (m/s2)
['Mercury', ' 2439.4', ' 60.83', ' 330.11', ' 5.4291', ' 3.70\n']
Mercury, 2439.4, 60.83, 330.11, 5.4291, 3.70
['Venus', '\t6052', ' 928.43', ' 4867.5', ' 5.243', ' 8.87\n']
Venus,	6052, 928.43, 4867.5, 5.243, 8.87
['Earth', '\t6371.0084', ' 1083.21', ' 5972.4', ' 5.5136', ' 9.8\t\n']
Earth,	6371.0084, 1083.21, 5972.4, 5.5136, 9.8
['Mars', ' 3389.5', ' 163.18', ' 641.71', ' 3.9341', ' 3.71\n']
Mars, 3389.5, 163.18, 641.71, 3.9341, 3.71
['Jupiter', ' 69911', ' 1431280', ' 1898187', ' 1.3262', ' 24.79']
Jupiter, 69911, 1431280, 1898187, 1.3262, 24.79


[' Mass (10e21 kg)', ' 60.83', ' 928.43', ' 1083.21', ' 163.18', ' 1431280']

### Short exercise 2

- Read the data from planetary_data_2.txt, split the data by planet. Create a new list for every planet and add it to a nested list called `list_planets`. This list contains a list of all the planets and its data.
- Write a script that gets an integer number from the user using `input()`.
- Return one of the planets from the list, where the index number is the number entered by the user.

In [45]:
planets = open('planetary_data_2.txt')
planets.read()

'Body, Radius (km), Mass (10e21 kg), Density (g/cm3), Gravity (m/s2)\nMercury, 2439.4, 60.83, 330.11, 5.4291, 3.70\nVenus,\t6052, 928.43, 4867.5, 5.243, 8.87\nEarth,\t6371.0084, 1083.21, 5972.4, 5.5136, 9.8\t\nMars, 3389.5, 163.18, 641.71, 3.9341, 3.71\nJupiter, 69911, 1431280, 1898187, 1.3262, 24.79'

In [29]:
planets = open('planetary_data_2.txt')

for x in planets:
    print(x)

Body, Radius (km), Mass (10e21 kg), Density (g/cm3), Gravity (m/s2)

Mercury, 2439.4, 60.83, 330.11, 5.4291, 3.70

Venus,	6052, 928.43, 4867.5, 5.243, 8.87

Earth,	6371.0084, 1083.21, 5972.4, 5.5136, 9.8	

Mars, 3389.5, 163.18, 641.71, 3.9341, 3.71

Jupiter, 69911, 1431280, 1898187, 1.3262, 24.79


In [59]:
list_planets = []

planets = open('planetary_data_2.txt')

for line in planets:
    splitLine = line.split(',')
    #print(splitLine) # print the splitted line
    #print(line.strip()) # print the stripped line as String
    print(line.strip())
    
    
#print(list_planets)

#user_input = int(input())

Body, Radius (km), Mass (10e21 kg), Density (g/cm3), Gravity (m/s2)
Mercury, 2439.4, 60.83, 330.11, 5.4291, 3.70
Venus,	6052, 928.43, 4867.5, 5.243, 8.87
Earth,	6371.0084, 1083.21, 5972.4, 5.5136, 9.8
Mars, 3389.5, 163.18, 641.71, 3.9341, 3.71
Jupiter, 69911, 1431280, 1898187, 1.3262, 24.79


# Writing to an existing file
We can use the `a` and `w` option to append and write

# dat = open('append.txt', 'a')
dat.write('Appending this to the end')
dat.close()
# Go check your append.txt file after this is run
# Remember to close your file if you manually open it

In [4]:
dat = open('append.txt', 'r')
dat.read()

'Sobreescribir contenidosAppending this to the endAppending this to the end'

In [5]:
# Overwrites the current file
datos = open('append.txt', 'w')
datos.write('Overwrite contents')

18

In [6]:
datos = open('append.txt', 'r')
datos.read()

'Overwrite contents'

In [7]:
datos.close()

### Short exercise
Create a script that writes numbers from 1 to 100, line by line to a text file.

## Working with CSV files
The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases.

We can use Python without any modules to read CSV files, first we need to import the `csv` module using the `import` keyword

In [31]:
# Import the csv module
import csv

with open('cereal.csv', newline='') as csvfile:
    read = csv.reader(csvfile, delimiter=' ')
    for row in read:
        print(', '.join(row))

name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
100%, Bran,N,C,70,4,1,130,10,5,6,280,25,3,1,0.33,68.402973
100%, Natural, Bran,Q,C,120,3,5,15,2,8,8,135,0,3,1,1,33.983679
All-Bran,K,C,70,4,1,260,9,7,5,320,25,3,1,0.33,59.425505
All-Bran, with, Extra, Fiber,K,C,50,4,0,140,14,8,0,330,25,3,1,0.5,93.704912
Almond, Delight,R,C,110,2,2,200,1,14,8,-1,25,3,1,0.75,34.384843
Apple, Cinnamon, Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1,0.75,29.509541
Apple, Jacks,K,C,110,2,0,125,1,11,14,30,25,2,1,1,33.174094
Basic, 4,G,C,130,3,2,210,2,18,8,100,25,3,1.33,0.75,37.038562
Bran, Chex,R,C,90,2,1,200,4,15,6,125,25,1,1,0.67,49.120253
Bran, Flakes,P,C,90,3,0,210,5,13,5,190,25,3,1,0.67,53.313813
Cap'n'Crunch,Q,C,120,1,2,220,0,12,12,35,25,2,1,0.75,18.042851
Cheerios,G,C,110,6,2,290,2,17,1,105,25,1,1,1.25,50.764999
Cinnamon, Toast, Crunch,G,C,120,1,3,210,0,13,9,45,25,2,1,0.75,19.823573
Clusters,G,C,110,3,2,140,2,13,7,105,25,3,1,0.5,40.400208
Cocoa, Puffs

### Brief intro to Pandas
The code above works, but for this course we will be using a module named Pandas to work with our CSV data. If you want to know more about the native csv module, check out [this link](https://docs.python.org/3/library/csv.html)

Python is a library used for working with data. It can analyze, clean, explore and manipulate data. Pandas makes reading datasets very easy

In [None]:
# Run only once to install the Pandas library
# Run this only if you have not added pandas via pip to your environment
!pip install pandas

In [35]:
# import our newly installed library
import pandas as pd # as pd is a shortname, we us this later

data = pd.read_csv('cereal.csv') # pd shortname
data.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


You can see the code above is easily more readable and relevant just by using pandas. The `.head()` will show the first 5 rows by default, you can specify how many rows you want to show by typing a different number in the `.head()` method.

In [34]:
data.head(7)

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
5,Apple Cinnamon Cheerios,G,C,110,2,2,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541
6,Apple Jacks,K,C,110,2,0,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094


In [37]:
# You can define an index column by using the index_col parameter
data = pd.read_csv('cereal.csv', index_col='name')
data.head()

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


To call a column simply use the column name

In [39]:
data.calories

name
100% Bran                     70
100% Natural Bran            120
All-Bran                      70
All-Bran with Extra Fiber     50
Almond Delight               110
                            ... 
Triples                      110
Trix                         110
Wheat Chex                   100
Wheaties                     100
Wheaties Honey Gold          110
Name: calories, Length: 77, dtype: int64

In [40]:
# What data type is this?
type(data)

pandas.core.frame.DataFrame

In [41]:
# And what data data is each of the columns
type(data.calories)

pandas.core.series.Series

In [42]:
# You can also index the dataframe similar to a list
data.iloc[0]

mfr                 N
type                C
calories           70
protein             4
fat                 1
sodium            130
fiber            10.0
carbo             5.0
sugars              6
potass            280
vitamins           25
shelf               3
weight            1.0
cups             0.33
rating      68.402973
Name: 100% Bran, dtype: object

In [44]:
data.iloc[15]

mfr                 R
type                C
calories          110
protein             2
fat                 0
sodium            280
fiber             0.0
carbo            22.0
sugars              3
potass             25
vitamins           25
shelf               1
weight            1.0
cups              1.0
rating      41.445019
Name: Corn Chex, dtype: object

In [45]:
# We can also index by index name
data[data.index == 'Trix']

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Trix,G,C,110,1,1,140,0.0,13.0,12,25,25,2,1.0,1.0,27.753301
