## Chapter 5. From file to data frame and back
#### Notebook for Python

Van Atteveldt, W., Trilling, D. & Arcila, C. (2022). <a href="https://cssbook.net" target="_blank">Computational Analysis of Communication</a>. Wiley.

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ccs-amsterdam/ccsbook/blob/master/chapter05/chapter_05_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
</table>

In [1]:
!pip3 install pandas nltk scikit-learn



In [1]:
import json
import urllib
import pandas as pd
import nltk
from nltk.corpus import state_union
nltk.download('punkt')
from sklearn.datasets import fetch_20newsgroups

### Dataframes

In [2]:

# Create two lists that will be columns
list1 = ["Anna", "Peter", "Sarah", "Kees"]
list2 = [40, 33, 40, 77]

# or we could have a list of lists instead
mytable = [["Anna", 40],
           ["Peter", 33],
           ["Sarah", 40],
           ["Kees", 77]]

# Convert an array to a dataframe
df=pd.DataFrame(mytable)

# Or create the data frame directly from vectors
df2=pd.DataFrame.from_records(zip(list1,list2))

# No. of rows, no. of columns, and shape
print(f"{len(df)} rows x {len(df.columns)} cols")
print(f"Its shape is {df.shape}")

print("Element-wise equality of df and df2:")
print(df == df2)

4 rows x 2 cols
Its shape is (4, 2)
Element-wise equality of df and df2:
      0     1
0  True  True
1  True  True
2  True  True
3  True  True


In [4]:
url = "https://cssbook.net/d/media.csv"
# Directly read a csv file from internet
df = pd.read_csv(url)

# We can also explicitly specify delimiter etc.
df = pd.read_csv(url, delimiter = ",")
# Note: use help(pd.read_csv) to see all options

# Save dataframe to a csv:
df.to_csv("mynewcsvfile.csv")

In [5]:
# Define stopword list in the code itself 
stopwords = ["and","or","a","an","the"]


# Better idea: Download stopwords file and read it
url = "https://cssbook.net/d/stopwords.txt"
urllib.request.urlretrieve(url, "stopwords.txt")
with open("stopwords.txt") as f:
    stopwords = [w.strip() for w in f] 
stopwords

['and', 'or', 'a', 'an', 'the']

In [6]:
# Modify the stopword list and save it:
stopwords += ["somenewstopword", "andanotherone"]
with open("newstopwords.txt",mode = "w") as f:
    f.writelines(stopwords)


    
# Use json to read/write dictionaries
somedict = {"label":"Report", 
            "entries":[1,2,3,4]}

with open("test.json",mode = "w") as f:
    json.dump(somedict, f)

with open("test.json",mode = "r") as f:
    d = json.load(f)
print(d)

{'label': 'Report', 'entries': [1, 2, 3, 4]}


### Datasets

In [2]:
# Note: use fetch_20newsgroups? for more options
d=fetch_20newsgroups(
    remove=("headers", "footers", "quotes"))
df=pd.DataFrame(zip(d["data"],d["target_names"]))
df.head()

Unnamed: 0,0,1
0,I was wondering if anyone out there could enli...,alt.atheism
1,A fair number of brave souls who upgraded thei...,comp.graphics
2,"well folks, my mac plus finally gave up the gh...",comp.os.ms-windows.misc
3,\nDo you have Weitek's address/phone number? ...,comp.sys.ibm.pc.hardware
4,"From article <C5owCB.n3p@world.std.com>, by to...",comp.sys.mac.hardware


In [3]:
# Note: download is only needed once...
nltk.download("state_union")
sentences = state_union.sents()
print(f"There are {len(sentences)} sentences.")

[nltk_data] Downloading package state_union to /home/wva/nltk_data...
[nltk_data]   Package state_union is already up-to-date!


There are 17930 sentences.
