No description, website, or topics provided.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
__pycache__
.gitignore
Generic_Notebook_Norwegian_National_Library.ipynb
LICENSE.txt
README.md
STTR_example.ipynb
Wordcloud installer.ipynb
clustring_collocation-infrastruktur-Copy1.ipynb
clustring_collocation-infrastruktur.ipynb
clustring_collocation.ipynb
desktop.ini
karakternettverk – Kopi.ipynb
karakternettverk.ipynb
korpus_sammenlign_metadata.ipynb
module_update – Kopi.py
module_update.py
nbtext.py
ngram_sammenlign.ipynb
requirements.txt
vekstdiagram.ipynb
wildcard_search.ipynb

README.md

NB_API_Python

A repository of Jupyter notebooks for accessing the full text API towards the Norwegian National Library, described here. Documentation of the functions can be found in the wiki Here is link to the binder.org version

Here is a short description of each function, which can be used as commands in a notebook. The source code is in file nbtext.py, have a look at that to see how API commands and URLs are used.

Metadata and URNs

totals(top=200)

Returns a dictionary of the top ´top´ words in the digital collection

urn_from_text(T)

Return a list of URNs as 13 digit serial numbers.

metadata(urn="""text""")

Returns a list of metada entries for given URN.

pure_urn(data)

Convert URN-lists with extra data into list of serial numbers. Used to convert different ways of presenting URNs into a list of serial decimal digits. Designed to work with book URNs, and will not work for newspaper URNs.

    Args:
        data: May be a list of URNs, a list of lists with URNs as their
            initial element, or a string of raw texts containing URNs
    Returns:
        List[str]: A list of URNs. Empty list if input is on the wrong
            format or contains no URNs

col_agg(df, col='sum')

Aggregate columns of a panda dataframe.

row_agg(df, col='sum')

Aggregate rows of a panda dataframe.

Access texts as frequency lists

get_freq(urn, top=50, cutoff=3)

Get frequency list of words for a given URN (as a serial number).

get_urn(metadata=None)

Get URNSs from metadata specified as a dictionary. Keys specified in quotes are:

  • "corpus": "avis" or "bok"
  • "author": wildcard match using % as wildcard.
  • "title": wildcard match using %. For newspapers this corresponds to name of paper.
  • "year": starting year as number or number as string.
  • "next": the next number of years starting from ´year´
  • "ddk": Dewy decimal number as wildcard match e.g. "64%"
  • "gender": value is "m" for male or "f" for female
  • "subject": keywords used to annotate text in the national bibliography.

get_corpus_text(urns, top = 10000, cutoff=5)

From a corpus as a list of URNs, get the top top words that have a frequency above cutoff. Builds on top of get_freq. Returns a dataframe with URNs as row headers and words as indices. k = dict() for u in urns: k[u] = get_freq(u, top = top, cutoff = cutoff) return pd.DataFrame(k)

get_papers(top=5, cutoff=5, navn='%', yearfrom=1800, yearto=2020, samplesize=100)

Get newspapers as frequency lists. Parameter top asks for the top ranked words, cutoff indicates the lower frequency limit, while navn indicates newspaper name as wildcard string.

Collocations and clusters

make_a_collocation(word, period=(1990, 2000), before=5, after=5, corpus='avis', samplesize=100, limit=2000)

Return a collocation as dataframe.

compute_assoc(coll_frame, column, exponent=1.1, refcolumn = 'reference_corpus')

Compute an association using PMI. return pd.DataFrame(coll_frame[column]**exponent/coll_frame.mean(axis=1))

collocation(word, yearfrom=2010, yearto=2018, before=3, after=3, limit=1000, corpus='avis')

Compute a collocation for a given word within indicated period. before is the number of preceeding words, after number of words following, limit. data = requests.get( "https://api.nb.no/ngram/collocation", params={ 'word':word, 'corpus':corpus, 'yearfrom':yearfrom, 'before':before, 'after':after, 'limit':limit, 'yearto':yearto}).json() return pd.DataFrame.from_dict(data['freq'], orient='index')

normalize_corpus_dataframe(df)

Normalized all values in corpus df as a dataframe. Changes df in situ, and returns True.

show_korpus(korpus, start=0, size=4, vstart=0, vsize=20, sortby = '')

Show part of a dataframe korpus, slicing along columns starting from startand numbers by size and slicing rows by vstartand vsize. Sorts by first column by default.

aggregate(korpus)

Make an aggregated sum of all documents across the corpus, here we use average return pd.DataFrame(korpus.fillna(0).mean(axis=1))

convert_list_of_freqs_to_dataframe(referanse)

The function get_papers() returns a list of frequencies - convert it and normalize.

get_corpus(top=5, cutoff=5, navn='%', corpus='avis', yearfrom=1800, yearto=2020, samplesize=10)

First version of collecting a corpus using parameters described above for get_papers (for newspapers) and get_corpus (for books).

Classes

class Cluster

def __init__(self, word = '', filename = '', period = (1950,1960) , before = 5, after = 5, corpus='avis', reference = 200, 
             word_samples=1000):

See clustering notebook for example and closer description.

class Corpus

See Corpus notebook for examples and explanation.

Corpus_urn

See example notebook

Graphs and network analysis

make_newspaper_network(key, wordbag, titel='%', yearfrom='1980', yearto='1990', limit=500)

Seems not to work at the moment.

make_network(urn, wordbag, cutoff=0)

Make a graph as networkx object from wordbag and urn. Two words are connected if they occur within same paragraph.

make_network_graph(urn, wordbag, cutoff=0)

Make a graph as networkx object from wordbag and urn. Two words are connected if they occur within same paragraph.

draw_graph_centrality(G, h=15, v=10, fontsize=20, k=0.2, arrows=False, font_color='black', threshold=0.01)

Draw a graph using force atlas.

Wordclouds

make_cloud(json_text, top=100, background='white', stretch=lambda x: 2**(10*x), width=500, height=500, font_path=None)*

Create a word cloud from a frequency list. First line of code: pairs0 = Counter(json_text).most_common(top)

draw_cloud(sky, width=20, height=20, fil='')

Draw a word cloud produces by make_cloud

cloud(pd, column='', top=200, width=1000, height=1000, background='black', file='', stretch=10, font_path=None)

Make and draw a cloud from a pandas dataframe, using make_cloud and draw_cloud.

Growth diagrams (sentiment analysis)

vekstdiagram(urn, params=None)

Make a growth diagram for a given book using a set of words: Parameters

'words': list of words 'window': chunk size in the book 'pr': how many words are skipped before next chunk

plot_sammen_vekst(urn, ordlister, window=5000, pr = 100)

For ploting more than one growth diagram. Have a look at example notebook.

Word relations and n-grams

difference(first, second, rf, rs, years=(1980, 2000),smooth=1, corpus='bok')

Compute difference of difference (first/second)/(rf/rs) for ngrams.

relaterte_ord(word, number = 20, score=False)

Find related words using eigenvector centrality from networkx. Related words are taken from NB Ngram. Note: Works for english and german - add parameter!!

nb_ngram(terms, corpus='bok', smooth=3, years=(1810, 2010), mode='relative')

Collect an ngram as json object from NB Ngram. Terms is string of comma separated ngrams (single words up to trigrams).

ngram_conv(ngrams, smooth=1, years=(1810,2013), mode='relative')

Convert ngrams to a dataframe.

make_graph(word)

Get graph like in NB Ngram

Concordances

get_konk(word, params=None, kind='html')

Get a concordance for given word. Params are like get_urn. Value is either an HTML-page, a json structure, or a dataframe. Specify kind as 'html', 'json' or '' respectively.

get_urnkonk(word, params=None, html=True)

Same as get_konk but from a list of URNs.

Character Analysis and graphs

central_characters(graph, n=10)

wrapper around networkx res = Counter(nx.degree_centrality(graph)).most_common(n) return res

central_betweenness_characters(graph, n=10)

wrapper around networkx res = Counter(nx.betweenness_centrality(graph)).most_common(n) return res

check_words(urn, ordbag)

Find frequency of words in ordbag within a book given by urn.

Text complexity

sttr(urn, chunk=5000)

Compute a standardized type/token-ratio for text identified with urn. The function expects the serial number of a URN for a book. Returns a number.

navn(urn)

Returns a dictionary of frequency of possible names from URN as serial number.

Utilities

heatmap(df, color='green')

A wrapper for heatmap of df as a Pandas dataframe, like this: return df.fillna(0).style.background_gradient(cmap=sns.light_palette(color, as_cmap=True))

df_combine(array_df)

Combine one column dataframes into one dataframe.

wildcardsearch(params=None)

Default values: params = {'word': '', 'freq_lim': 50, 'limit': 50, 'factor': 2} Returns a dataframe containing matches for word. See examples in notebook wildcardsearch.

sorted_wildcardsearch(params)

Same as wildcardsearch with results sorted on frequency.

combine(clusters)

Make new collocation analyses from data in clusters

cluster_join(cluster)

Used with serial clusters. Join them together in one dataframe. See example in cluster notebook.

serie_cluster(word, startår, sluttår, inkrement, before=5, after=5, reference=150, word_samples=500)

Make a series of clusters.

save_serie_cluster(tidscluster)

Save series to files.

les_serie_cluster(word, startår, sluttår, inkrement)

Read them

frame(something, name)

Create a dataframe of something

get_urns_from_docx(document)

A file in docx format may contain a list of URNs.

get_urns_from_text(document)

URNs from a .txt-document.

get_urns_from_files(mappe, file_type='txt')

Extract URNs from a folder with .txt and .docs files. Returns a dictionary with filenames as keys, each with a list of URNs.

check_vals(korpus, vals)

A wrapper for dataframes: return korpus[korpus.index.isin(vals)].sort_values(by=0, ascending=False)