Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flair with the icelandic_ner dataset #2114

Merged
merged 4 commits into from
Mar 9, 2021

Conversation

TatianaMoteuN
Copy link
Contributor

No description provided.

Copy link
Collaborator

@alanakbik alanakbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! Unfortunately, this code does not compile for me. Some of the variables are undefined. Can you check and update?

data_folder = base_path / dataset_name

# download data if necessary
ZipFile.extractall(path=icelandic_ner, members="https://repository.clarin.is/repository/xmlui/handle/20.500.12537/42/allzip", pwd=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable icelandic_ner is not defined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed this morning please can you check it again?

outfile.write(contents)

# download files if not present locally
cached_path(f"{icelandic_ner_path}ned.testa", data_folder / 'raw')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable icelandic_ner_path is not defined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it this morning please can you check it again?


# we need to slightly modify the original files by adding some new lines after document separators
train_data_file = data_folder / 'train.txt'
if not train_data_file.is_file():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this part necessary? Are extra offsets needed? Maybe you can use the files as they are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it this morning please can you check it again?

with open("icelandic_ner_path/train.txt", "w") as outfile:
# download zip
icelandic_ner ="https://repository.clarin.is/repository/xmlui/handle/20.500.12537/42/allzip"
icelandic_ner_path = cached_path(icelandic_ner, Path("datasets") / dataset_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are indentation problems here, causing the program to break,

@TatianaMoteuN
Copy link
Contributor Author

Hello @alanakbik,
please I would like you to have a look at this code because I'm confused about what is wrong

default dataset folder is the cache root

    if not base_path:
        base_path = Path(flair.cache_root) / "datasets"
    data_folder = base_path / dataset_name

    if not os.path.isfile(data_folder / 'icelandic_ner.txt'):
        # download zip
        icelandic_ner ="https://repository.clarin.is/repository/xmlui/handle/20.500.12537/42/allzip"
        icelandic_ner_path = cached_path(icelandic_ner, Path("datasets") / dataset_name)

        #unpacking the zip
        unpack_file(
              icelandic_ner_path,
              data_folder,
              mode="zip",
              keep=True
          )
    #merge the files in one as the zip is containing multiples files
    #entries = os.path.listfile(data_folder)

    with open("icelandic_ner.txt", "wb") as outfile:
        for files in os.walk(data_folder):
           # # print(files[2])
           #  files = glob.glob('*.txt')
            for filename in files[2]:
                if filename.endswith('.txt'):
                    with open(filename) as infile:
                        contents = infile.read()
                       # print(contents)
                        outfile.write(contents)

and the error

/home/aimsgh/home/aimsgh/SCIoI/flair/lib/python3.7/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
Traceback (most recent call last):
File "/home/aimsgh/SCIoI/flair/flair-1/load_dataset.py", line 3, in
icelandic_ner = ICELANDIC_NER()
File "/home/aimsgh/SCIoI/flair/flair-1/flair/datasets/sequence_labeling.py", line 654, in init
with open(filename) as infile:
FileNotFoundError: [Errno 2] No such file or directory: 'fbl.txt'

Process finished with exit code 1

@alanakbik
Copy link
Collaborator

You are not specifying the correct path to the file. You are only giving the open method the filename. You need to specify the full path to the file. Also for the outfile.

@TatianaMoteuN
Copy link
Contributor Author

TatianaMoteuN commented Mar 5, 2021 via email

Copy link
Collaborator

@alanakbik alanakbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but please remove the local file and add the tag_to_bioes parameter.

load_dataset.py Outdated
@@ -0,0 +1,15 @@
from flair.datasets import ICELANDIC_NER
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not add local files to git!

data_folder,
columns,
train_file='icelandic_ner.txt',
in_memory=in_memory,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add the tag_to_bioes parameter here? i.e

        `tag_to_bioes=tag_to_bioes,`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@alanakbik
Copy link
Collaborator

@TatianaMoteuN thanks for adding this!

@alanakbik alanakbik merged commit c17df92 into flairNLP:master Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants